Message boards :
Number crunching :
issue of the day
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Please post to this thead any issues you are having today and we will try to fix them. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
No output of the screens ALT+F1, ALT+F4 and ALT+F5 Most logs are in the map run-1. These logs are updated: MasterLog 21-Aug-2015 10:00 1.9K ProcLog 21-Aug-2015 10:00 311K StartdLog 21-Aug-2015 09:57 419K StarterLog 21-Aug-2015 09:42 16K These logs are not updated and contains only the info of the very first job: FrameworkJobReport.xml 21-Aug-2015 08:15 21K _condor_stdout 21-Aug-2015 08:17 283K cmsRun-stderr.log 21-Aug-2015 07:27 243 cmsRun-stdout.log 21-Aug-2015 08:15 35K glidein-stderr 21-Aug-2015 07:25 28K glidein-stdout 21-Aug-2015 07:25 4.3K scramOutput.log 21-Aug-2015 07:27 659 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I confirm the findings of Crystal Pellet and a run-2 Folder was generated about 6h after run-1. Otherwise it is now behaving the same way. / logs /cron-stdout: 00:32:01 +0200 2015-08-21 [INFO] Starting CMS Application - Run 1 00:32:01 +0200 2015-08-21 [INFO] Reading the BOINC volunteer's information 00:32:04 +0200 2015-08-21 [INFO] Volunteer: Rasputin42 (277) Host: 617 00:32:04 +0200 2015-08-21 [INFO] Requesting an X509 credential 00:32:06 +0200 2015-08-21 [INFO] Downloading glidein 00:32:06 +0200 2015-08-21 [INFO] Running glidein (check logs) 06:19:02 +0200 2015-08-21 [INFO] CMS glidein Run 1 ended 06:20:01 +0200 2015-08-21 [INFO] Starting CMS Application - Run 2 06:20:01 +0200 2015-08-21 [INFO] Reading the BOINC volunteer's information 06:20:06 +0200 2015-08-21 [INFO] Volunteer: Rasputin42 (277) Host: 617 06:20:06 +0200 2015-08-21 [INFO] Requesting an X509 credential 06:20:09 +0200 2015-08-21 [INFO] Downloading glidein 06:20:10 +0200 2015-08-21 [INFO] Running glidein (check logs) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 444 |
Yes, I'm getting something similar. Thu 20 Aug 2015 05:32:46 PM BST | CMS-dev | Resetting project Thu 20 Aug 2015 05:32:54 PM BST | CMS-dev | work fetch resumed by user Thu 20 Aug 2015 05:32:55 PM BST | CMS-dev | update requested by user Thu 20 Aug 2015 05:32:58 PM BST | CMS-dev | Master file download succeeded Thu 20 Aug 2015 05:33:03 PM BST | CMS-dev | Sending scheduler request: Requested by user. Thu 20 Aug 2015 05:33:03 PM BST | CMS-dev | Requesting new tasks for CPU Thu 20 Aug 2015 05:33:05 PM BST | CMS-dev | Scheduler request completed: got 1 new tasks And now in "graphics" logs I see: boot.log 20-Aug-2015 17:41 11K [ ] cron-stderr 20-Aug-2015 17:42 64 [TXT] cron-stdout 21-Aug-2015 06:12 1.2K [DIR] run-1/ 20-Aug-2015 17:47 - [DIR] run-2/ 21-Aug-2015 00:09 - [DIR] run-3/ 21-Aug-2015 06:14 - but delving into the run directories -- run-1: [TXT] FrameworkJobReport.xml 20-Aug-2015 18:28 22K [ ] MasterLog 21-Aug-2015 00:05 2.2K [ ] ProcLog 21-Aug-2015 00:05 775K [ ] StartdLog 21-Aug-2015 00:05 1.1M [ ] StarterLog 21-Aug-2015 00:05 43K [TXT] _condor_stderr 20-Aug-2015 17:43 0 [ ] _condor_stdout 20-Aug-2015 18:30 282K [ ] cmsRun-stderr.log 20-Aug-2015 17:46 243 [TXT] cmsRun-stdout.log 20-Aug-2015 18:28 35K [ ] cron-stderr 20-Aug-2015 17:42 64 [TXT] cron-stdout 21-Aug-2015 06:12 1.2K [ ] glidein-stderr 21-Aug-2015 00:05 126K [ ] glidein-stdout 21-Aug-2015 00:05 8.9K [ ] scramOutput.log 20-Aug-2015 17:46 658 run-2: [TXT] FrameworkJobReport.xml 21-Aug-2015 00:37 22K [ ] MasterLog 21-Aug-2015 06:10 2.2K [ ] ProcLog 21-Aug-2015 06:10 737K [ ] StartdLog 21-Aug-2015 06:10 1.1M [ ] StarterLog 21-Aug-2015 06:09 42K [TXT] _condor_stderr 21-Aug-2015 00:08 0 [ ] _condor_stdout 21-Aug-2015 00:38 282K [ ] cmsRun-stderr.log 21-Aug-2015 00:08 244 [TXT] cmsRun-stdout.log 21-Aug-2015 00:37 35K [ ] glidein-stderr 21-Aug-2015 06:10 126K [ ] glidein-stdout 21-Aug-2015 06:10 8.9K [ ] scramOutput.log 21-Aug-2015 00:08 661 run-3: [TXT] FrameworkJobReport.xml 21-Aug-2015 06:43 22K [ ] MasterLog 21-Aug-2015 09:09 2.3K [ ] ProcLog 21-Aug-2015 10:39 547K [ ] StartdLog 21-Aug-2015 10:35 828K [ ] StarterLog 21-Aug-2015 10:25 35K [TXT] _condor_stderr 21-Aug-2015 06:13 0 [ ] _condor_stdout 21-Aug-2015 06:45 282K [ ] cmsRun-stderr.log 21-Aug-2015 06:13 244 [TXT] cmsRun-stdout.log 21-Aug-2015 06:43 35K [ ] glidein-stderr 21-Aug-2015 06:13 28K [ ] glidein-stdout 21-Aug-2015 06:13 4.2K [ ] scramOutput.log 21-Aug-2015 06:13 661 So, it's not starting a new log directory for each job, and some, but not all, of the logs are ending up in the currently opened directory. ALT-F4 and ALT-F5 console displays are also still blank. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
I checked "suspend" and "Resume" and it seems to have worked fine, as long as I didn't exit boinc. When I exit BOINC all unfinished work seems to be lost and it starts again from Zero; don't know if it first fetches a new Job or restarts the old one |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
Still running in run-1 7 cmsRun's successful returned (triggered by condor_exec.exe, I suppose): 08/21/15 07:26:15 (pid:7627) Create_Process succeeded, pid=7631 08/21/15 08:17:15 (pid:7627) Process exited, pid=7631, status=151 08/21/15 08:17:18 (pid:9763) Create_Process succeeded, pid=9769 08/21/15 09:02:28 (pid:9763) Process exited, pid=9769, status=151 08/21/15 09:02:31 (pid:11886) Create_Process succeeded, pid=11890 08/21/15 09:42:02 (pid:11886) Process exited, pid=11890, status=151 08/21/15 09:42:05 (pid:13600) Create_Process succeeded, pid=13605 08/21/15 10:02:33 (pid:13600) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9622 08/21/15 10:02:33 (pid:13600) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9622 failed; will try to reconnect in 60 seconds. 08/21/15 10:03:33 (pid:13600) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#23054 08/21/15 10:03:38 (pid:13600) Accepted request to reconnect from <130.246.180.120:9818> 08/21/15 10:03:38 (pid:13600) Ignoring old shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_11916> 08/21/15 10:03:38 (pid:13600) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_11916> 08/21/15 10:27:32 (pid:13600) Process exited, pid=13605, status=151 08/21/15 10:27:36 (pid:15521) Create_Process succeeded, pid=15527 08/21/15 11:11:01 (pid:15521) Process exited, pid=15527, status=151 08/21/15 11:11:03 (pid:17375) Create_Process succeeded, pid=17379 08/21/15 11:56:30 (pid:17375) Process exited, pid=17379, status=151 08/21/15 11:56:33 (pid:19290) Create_Process succeeded, pid=19294 08/21/15 12:38:36 (pid:19290) Process exited, pid=19294, status=151 |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Rasputin42, Your log suggests the glidein is run every 6 hours so I would expect to see a run-x directory every six hours. If the glidein is getting multiple jobs then the log files may only show the first job. Will investigate further. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
Rasputin42, Your log suggests the glidein is run every 6 hours so I would expect to see a run-x directory every six hours. If the glidein is getting multiple jobs then the log files may only show the first job. Will investigate further. [DIR] run-1/ 21-Aug-2015 07:28 - [DIR] run-2/ 21-Aug-2015 13:29 - Looks like that's true. I hope the glidein waits until the running cmsRun has finished. It seems so or was it coincidence just 6 hours and 1 minute in between. used default retire time, 21600 using default retire spread, 2160 From the starterLog: 08/21/15 12:38:39 (pid:21092) Create_Process succeeded, pid=21096 08/21/15 13:26:14 (pid:21092) Process exited, pid=21096, status=151 At least a few minutes before creating map run-2, condor_exec.exe ended. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
Looping again: Every 2 minutes a new map run-x is created. [ ] boot.log 21-Aug-2015 07:24 11K [TXT] cron-stderr 21-Aug-2015 07:25 0 [TXT] cron-stdout 21-Aug-2015 13:56 7.9K [DIR] run-1/ 21-Aug-2015 07:28 - [DIR] run-2/ 21-Aug-2015 13:29 - [DIR] run-3/ 21-Aug-2015 13:31 - [DIR] run-4/ 21-Aug-2015 13:33 - [DIR] run-5/ 21-Aug-2015 13:35 - [DIR] run-6/ 21-Aug-2015 13:37 - [DIR] run-7/ 21-Aug-2015 13:39 - [DIR] run-8/ 21-Aug-2015 13:41 - [DIR] run-9/ 21-Aug-2015 13:43 - [DIR] run-10/ 21-Aug-2015 13:45 - [DIR] run-11/ 21-Aug-2015 13:47 - [DIR] run-12/ 21-Aug-2015 13:49 - [DIR] run-13/ 21-Aug-2015 13:51 - [DIR] run-14/ 21-Aug-2015 13:53 - [DIR] run-15/ 21-Aug-2015 13:55 - |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
The machine I was watching yesterday was looping until 22:40, and then switched to the 6-hour cycle at about run-201. It's due to shut down the current BOINC task instance within an hour, and start a new one. It'll be interesting to see how the log map changes. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Please could you email me the glidein-stdout and glidein-stderr from one of the runs. Thanks, |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The glidein creates a working directory glide_xxxx and the CMSRun is executed in a subdirectory called dir_xxxx. I have now replicated this structure in the Web logs so hopefully we will see what is happening. The update has been pushed and will appear shortly. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
Rebooted the VM to get rid of the looping and now 'top'=ALT+F3 no longer available, logs not available at all=webpage, only ALT+F1 show Running glidein (check logs) and ALT+F2 show me that cmsRun is working 35 minutes now. (even ALT+F6 or 10 not available) Will wait to see what's happening after this 1st cmsRun after the boot is ready. A 2nd cmsRun started, but else no differences, no improved access. I'll request a new BOINC-task in the hope that with a fresh VM Laurence update is already available. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
20 minutes into a new task, and the console has been stuck at for at least the last 10 of them. Having lit the blue touch paper, I'm now retiring to a safe distance and waiting... |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
The glidein creates a working directory glide_xxxx and the CMSRun is executed in a subdirectory called dir_xxxx. I have now replicated this structure in the Web logs so hopefully we will see what is happening. The update has been pushed and will appear shortly. Yeah, see this structure now growing. Do you want something from this to see ? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
20 minutes into a new task, and the console has been stuck at Had the same for over 30 minutes and decided to reboot the VM. Now it is working with the new directory structure underneath run-1/: [DIR] glide_JsrsA8/ 21-Aug-2015 15:52 - [ ] glidein-stderr 21-Aug-2015 15:50 28K [ ] glidein-stdout 21-Aug-2015 15:50 4.3K Edit: In the glide_xxxxxx directory another sub-directory called dir_xxxxx where xxxxx = the process-id of condor_starter with cmsRun output in it. cmsRun is busy and all usual ALT+Fx display information. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 444 |
Yes, I'm seeing that directory structure under run-4 now. Managed to catch it between jobs so I aborted and started a new task. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yep, hopefully the Web logs and console are working now for everyone. It is Friday afternoon so I won't touch anything now. Let's see how it goes over the weekend. I would be interested to know if anyone has any suspend/resume related issues. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,609 RAC: 15 |
It is Friday afternoon so I won't touch anything now. Yeah, hands off . . . only to touch a well-deserved drink. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Looks like each subdirectory: /logs/run-3/glide_sN6j69/dir_14859/cmsRun-stdout.log /logs/run-3/glide_sN6j69/dir_19030/cmsRun-stdout.log contains the results of one lot of 200 runs. Between run 200 and run 1 of the next lot is a gap of about 7-8 minutes. I guess, that is the time it takes to reinitialize everything and get new data. |
©2024 CERN