Thread 'issue of the day'

Author	Message
m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1369 - Posted: 29 Oct 2015, 10:39:20 UTC - in response to Message 1367. Last modified: 29 Oct 2015, 11:05:07 UTC That's what happens here. It loses work in progress whenever BOINC is shut down. Later versions of BOINC can save the VM when shut down gracefully (BOINC does a lot of cleaning up at the end of it's task, so the VM is removed then) but the actual CMS job is "abandoned" and a new one is picked up. It isn't as bad as it sounds.Conventional BOINC projects, well most of them, use checkpoints and this behaviour is equivalent to having a checkpoint interval equal to the CMS cycle length... or have I got something horribly wrong? Edit. It is (or was) possible to make the BOINC task run time whatever you want (within reason) by using the anonymous platform mechanism which let you use your own job file. I don't do this now but it used to work very well for T4T with run times of 48h or longer. Not sure if it would work here, and there is still the BOINC task deadline which may need to be increased. It was also possible to run the VM outside BOINC (as I remember, Tullio did this very successfully) that way, the VM ran continuously. Updates had to be done manually though. ID: 1369 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1376 - Posted: 29 Oct 2015, 14:34:30 UTC - in response to Message 1367. Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. ID: 1376 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 1377 - Posted: 29 Oct 2015, 15:13:14 UTC - in response to Message 1376. Last modified: 29 Oct 2015, 15:15:40 UTC Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. It does: completion_trigger_file This provides a more bulletproof way for VM apps to exit; sometimes VMs fail to shut down, and the task hangs indefinitely. When the VM app is done, it writes a file of this name in the shared directory; the file can optionally contain an integer exit code (first line) a bool value for whether it should bubble up to the volunteer (second line) and stderr text (subsequent lines). If vboxwrapper finds this file, it cleans up the VM and exits with the given code (default 0). You have to add the lines <enable_shared_directory/> <completion_trigger_file>filename</completion_trigger_file> to your Vbox job description file. ID: 1377 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1378 - Posted: 29 Oct 2015, 15:13:23 UTC - in response to Message 1376. There is not much point optimizing job length, if the last job in a 24h period is lost. The longer the job, the more is lost at the end. As far as i am concerned, this is an issue that deserves high priority. ID: 1378 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,302 RAC: 63	Message 1380 - Posted: 29 Oct 2015, 20:55:05 UTC - in response to Message 1377. Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. It does: completion_trigger_file This provides a more bulletproof way for VM apps to exit; sometimes VMs fail to shut down, and the task hangs indefinitely. When the VM app is done, it writes a file of this name in the shared directory; the file can optionally contain an integer exit code (first line) a bool value for whether it should bubble up to the volunteer (second line) and stderr text (subsequent lines). If vboxwrapper finds this file, it cleans up the VM and exits with the given code (default 0). You have to add the lines <enable_shared_directory/> <completion_trigger_file>filename</completion_trigger_file> to your Vbox job description file. I simulated the use of the tags with success. The only thing you have to implement in the VM is the check whether there is time enough left for at least 1 whole job and the writing of a "file to stop" into the shared directory. 2015-10-29 21:43:19 (7868): VM Completion File Detected. 2015-10-29 21:43:19 (7868): Powering off VM. 2015-10-29 21:43:21 (7868): Successfully stopped VM. 2015-10-29 21:43:26 (7868): Deregistering VM. (boinc_b811b2fa61782689, slot#0) 2015-10-29 21:43:26 (7868): Removing virtual disk drive(s) from VM. 2015-10-29 21:43:26 (7868): Removing network bandwidth throttle group from VM. 2015-10-29 21:43:26 (7868): Removing storage controller(s) from VM. 2015-10-29 21:43:26 (7868): Removing VM from VirtualBox. 21:43:31 (7868): called boinc_finish(0) ID: 1380 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1402 - Posted: 1 Nov 2015, 14:49:51 UTC - in response to Message 1380. Issue of the day today was a 90 minute power cut ! On resuming, the 24hr task that was due to finish first has reset the percent counter back to 0 (but is increasing). The Graphics and Show VM buttons are greyed out but I found the web page address and looked at what was there: [ ] boot.log 01-Nov-2015 13:57 11K [TXT] cron-stderr 31-Oct-2015 14:18 0 [TXT] cron-stdout 01-Nov-2015 13:57 4.8K [DIR] run-0/ 01-Nov-2015 13:58 - All that remains of the original structure/files is the cron-stderr file, everything else is new since the reboot. My question is why all the other files/folders have been deleted but the cron-stderr file survived ? ID: 1402 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 8	Message 1403 - Posted: 1 Nov 2015, 16:10:10 UTC - in response to Message 1402. Possibly it was the only thing that had been written to the VM's disk image at the time of the outage. Alternatively, it may be the only thing not written to since the recovery. I believe a Linux directory's timestamp is updated when a file within the directory is altered. Moral: VMs aren't really designed to cope with unexpected power cuts? ID: 1403 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1414 - Posted: 2 Nov 2015, 14:40:30 UTC Last modified: 2 Nov 2015, 14:52:39 UTC I have a finished job 347 for the 75events/job lot. (Task: 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev) The task monitor on crab3 user monitor reports a walltime of 1h10m. However the CMS-run_stdout.log reports a total time of 10491 sec, which is almost 3h. Why is that? All other jobs have a very similar wallclock to total time.(8 jobs) What exactly is walltime in this case? ID: 1414 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 8	Message 1415 - Posted: 2 Nov 2015, 15:44:18 UTC - in response to Message 1414. I think I'd put that down to the general lack of trustworthiness of Dashboard. It does get things wrong sometimes, possibly as a result of a garbled message from the Condor server. I can't see anything in the log file to support that short wall-time. ID: 1415 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1416 - Posted: 2 Nov 2015, 16:19:19 UTC - in response to Message 1415. Thanks for the reply. Has any progress been made on the proper terminating of the 24h boinc task? (without the loss of the final job) ID: 1416 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1419 - Posted: 2 Nov 2015, 21:58:27 UTC - in response to Message 1416. No, but it is near the top of the todo list. ID: 1419 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1425 - Posted: 5 Nov 2015, 3:09:10 UTC Got project message that new wrapper was in place to fix "VirtualBox not running" on Mac (I think, it was talking about Mac path changes anyway). Still getting same error here - unless there are still old tasks in the queue? Silly question... since the task is not being downloaded and failing, but is never being downloaded at all, how is a wrapper change going to fix the problem? ID: 1425 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1426 - Posted: 5 Nov 2015, 3:11:36 UTC - in response to Message 1425. Never mind, that message was from 10/22, about wrapper 26178. Already knew that didn't change anything. ID: 1426 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1431 - Posted: 7 Nov 2015, 15:56:53 UTC "VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes... User 306, task 69452, wu 64594, host 782. ID: 1431 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 8	Message 1432 - Posted: 7 Nov 2015, 18:45:03 UTC - in response to Message 1431. "VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes... User 306, task 69452, wu 64594, host 782. Thanks for the report, Bill -- what time-zone are you in, looks to be UTC +6? Nothing leaps out at me in the stderr report. That host appears to have had just two jobs during the task 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.3706.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:42:54 GMT 2015 on 306-782-14078 with (short) status 65 ======== 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.385.1.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:19:19 GMT 2015 on 306-782-14078 with (short) status 65 ======== Was it doing other work at the time? Hmm, the log isn't good: == CMSSW: cmsRun -j FrameworkJobReport.xml PSet.py == CMSSW: ----- Begin Fatal Exception 06-Nov-2015 17:19:15 CET----------------------- == CMSSW: An exception of category 'PluginLibraryLoadError' occurred while == CMSSW: [0] Constructing the EventProcessor == CMSSW: [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' == CMSSW: Exception Message: == CMSSW: unable to load /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_2_0_SLHC26/lib/ slc6_amd64_gcc472/pluginJetMETCorrectionsModulesPlugins.so because libDataFormatsRecoCandidate.so: cannot open shared object file: No such file or directory == CMSSW: ----- End Fatal Exception ------------------------------------------------ == CMSSW: Complete == CMSSW: process id is 10043 status is 65 It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image. ID: 1432 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1434 - Posted: 7 Nov 2015, 21:09:50 UTC - in response to Message 1432. what time-zone are you in, looks to be UTC +6? Believe so - Dallas/Chicago Central Time. Was it doing other work at the time? DENIS@Home, Malariacontrol, CSG probably; have CMS set at 50% to minimize swapping, but 1-task limit keeps it down to 1 core per host. That box has 4 (slow) cores, AMD APU. {quote]It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image.[/quote] Will do! Thanks. ID: 1434 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 8	Message 1435 - Posted: 7 Nov 2015, 23:02:55 UTC - in response to Message 1434. Cheers! ID: 1435 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 1443 - Posted: 12 Nov 2015, 19:42:13 UTC I'm getting the error "Cloud not get a proxy from CMS-Dev" ! Then says going to sleep for an hour, which it does :-) This is on 24 hour jobs that started after you announced the fix/update. Just noticed that now it tries "Requesting an X509 credential from CMS-Dev" followed by "Requesting an X509 credential from LHC@home" Don't know when that started happening. Would have cut and pasted but I can't remember what keystrokes let me highlight and copy the text from ALT-F1. ID: 1443 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 8	Message 1445 - Posted: 12 Nov 2015, 21:49:53 UTC - in response to Message 1443. I'm getting the error "Cloud not get a proxy from CMS-Dev" ! Then says going to sleep for an hour, which it does :-) This is on 24 hour jobs that started after you announced the fix/update. Just noticed that now it tries "Requesting an X509 credential from CMS-Dev" followed by "Requesting an X509 credential from LHC@home" Don't know when that started happening. Would have cut and pasted but I can't remember what keystrokes let me highlight and copy the text from ALT-F1. I noticed that typo when I was looking up Yeti's "user-ID not found" message. :-) Don't know about the rest though, looks like a job or two for Laurence tomorrow (possibly a cut-'n'-paste gone wrong somewhere). ID: 1445 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1449 - Posted: 13 Nov 2015, 12:07:14 UTC - in response to Message 1445. This is in preparation for adding the CMS app to LHC@home (where T4T runs). First the CMS-Dev is tried and if this fails, it tries LHC@home. You can ignore LHC@home as this should fail as you are using your CMS@home credential. What is concerning is why it is failing to get a proxy from CMS-Dev. How often does this happen? ID: 1449 · Rating: 0 · rate: / Reply Quote

Development for LHC@home