Message boards : Number crunching : issue of the day
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 58 ![]() ![]() |
That's what happens here. It loses work in progress whenever BOINC is shut down. Later versions of BOINC can save the VM when shut down gracefully (BOINC does a lot of cleaning up at the end of it's task, so the VM is removed then) but the actual CMS job is "abandoned" and a new one is picked up. It isn't as bad as it sounds.Conventional BOINC projects, well most of them, use checkpoints and this behaviour is equivalent to having a checkpoint interval equal to the CMS cycle length... or have I got something horribly wrong? Edit. It is (or was) possible to make the BOINC task run time whatever you want (within reason) by using the anonymous platform mechanism which let you use your own job file. I don't do this now but it used to work very well for T4T with run times of 48h or longer. Not sure if it would work here, and there is still the BOINC task deadline which may need to be increased. It was also possible to run the VM outside BOINC (as I remember, Tullio did this very successfully) that way, the VM ran continuously. Updates had to be done manually though. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 1 ![]() |
Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,888 RAC: 73 ![]() ![]() |
Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. It does: completion_trigger_file You have to add the lines <enable_shared_directory/> <completion_trigger_file>filename</completion_trigger_file> to your Vbox job description file. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 1 ![]() |
There is not much point optimizing job length, if the last job in a 24h period is lost. The longer the job, the more is lost at the end. As far as i am concerned, this is an issue that deserves high priority. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 996,888 RAC: 73 ![]() ![]() |
Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour. I simulated the use of the tags with success. The only thing you have to implement in the VM is the check whether there is time enough left for at least 1 whole job and the writing of a "file to stop" into the shared directory. 2015-10-29 21:43:19 (7868): VM Completion File Detected. 2015-10-29 21:43:19 (7868): Powering off VM. 2015-10-29 21:43:21 (7868): Successfully stopped VM. 2015-10-29 21:43:26 (7868): Deregistering VM. (boinc_b811b2fa61782689, slot#0) 2015-10-29 21:43:26 (7868): Removing virtual disk drive(s) from VM. 2015-10-29 21:43:26 (7868): Removing network bandwidth throttle group from VM. 2015-10-29 21:43:26 (7868): Removing storage controller(s) from VM. 2015-10-29 21:43:26 (7868): Removing VM from VirtualBox. 21:43:31 (7868): called boinc_finish(0) |
![]() Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0 ![]() ![]() |
Issue of the day today was a 90 minute power cut ! On resuming, the 24hr task that was due to finish first has reset the percent counter back to 0 (but is increasing). The Graphics and Show VM buttons are greyed out but I found the web page address and looked at what was there: [ ] boot.log 01-Nov-2015 13:57 11K [TXT] cron-stderr 31-Oct-2015 14:18 0 [TXT] cron-stdout 01-Nov-2015 13:57 4.8K [DIR] run-0/ 01-Nov-2015 13:58 - All that remains of the original structure/files is the cron-stderr file, everything else is new since the reboot. My question is why all the other files/folders have been deleted but the cron-stderr file survived ? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
Possibly it was the only thing that had been written to the VM's disk image at the time of the outage. Alternatively, it may be the only thing not written to since the recovery. I believe a Linux directory's timestamp is updated when a file within the directory is altered. Moral: VMs aren't really designed to cope with unexpected power cuts? ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 1 ![]() |
I have a finished job 347 for the 75events/job lot. (Task: 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev) The task monitor on crab3 user monitor reports a walltime of 1h10m. However the CMS-run_stdout.log reports a total time of 10491 sec, which is almost 3h. Why is that? All other jobs have a very similar wallclock to total time.(8 jobs) What exactly is walltime in this case? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 1 ![]() |
Thanks for the reply. Has any progress been made on the proper terminating of the 24h boinc task? (without the loss of the final job) |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 1 ![]() |
No, but it is near the top of the todo list. |
![]() Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 ![]() ![]() |
Got project message that new wrapper was in place to fix "VirtualBox not running" on Mac (I think, it was talking about Mac path changes anyway). Still getting same error here - unless there are still old tasks in the queue? Silly question... since the task is not being downloaded and failing, but is never being downloaded at all, how is a wrapper change going to fix the problem? |
![]() Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 ![]() ![]() |
Never mind, that message was from 10/22, about wrapper 26178. Already knew that didn't change anything. |
![]() Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 ![]() ![]() |
"VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes... User 306, task 69452, wu 64594, host 782. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
"VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes... Thanks for the report, Bill -- what time-zone are you in, looks to be UTC +6? Nothing leaps out at me in the stderr report. That host appears to have had just two jobs during the task 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.3706.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:42:54 GMT 2015 on 306-782-14078 with (short) status 65 ======== 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.385.1.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:19:19 GMT 2015 on 306-782-14078 with (short) status 65 ======== Was it doing other work at the time? Hmm, the log isn't good: == CMSSW: cmsRun -j FrameworkJobReport.xml PSet.py == CMSSW: ----- Begin Fatal Exception 06-Nov-2015 17:19:15 CET----------------------- == CMSSW: An exception of category 'PluginLibraryLoadError' occurred while == CMSSW: [0] Constructing the EventProcessor == CMSSW: [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag' == CMSSW: Exception Message: == CMSSW: unable to load /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_2_0_SLHC26/lib/ slc6_amd64_gcc472/pluginJetMETCorrectionsModulesPlugins.so because libDataFormatsRecoCandidate.so: cannot open shared object file: No such file or directory == CMSSW: ----- End Fatal Exception ------------------------------------------------ == CMSSW: Complete == CMSSW: process id is 10043 status is 65 It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image. ![]() |
![]() Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 ![]() ![]() |
what time-zone are you in, looks to be UTC +6? Believe so - Dallas/Chicago Central Time. Was it doing other work at the time? DENIS@Home, Malariacontrol, CSG probably; have CMS set at 50% to minimize swapping, but 1-task limit keeps it down to 1 core per host. That box has 4 (slow) cores, AMD APU. {quote]It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image.[/quote] Will do! Thanks. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
![]() Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 0 ![]() ![]() |
I'm getting the error "Cloud not get a proxy from CMS-Dev" ! Then says going to sleep for an hour, which it does :-) This is on 24 hour jobs that started after you announced the fix/update. Just noticed that now it tries "Requesting an X509 credential from CMS-Dev" followed by "Requesting an X509 credential from LHC@home" Don't know when that started happening. Would have cut and pasted but I can't remember what keystrokes let me highlight and copy the text from ALT-F1. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
I'm getting the error "Cloud not get a proxy from CMS-Dev" ! I noticed that typo when I was looking up Yeti's "user-ID not found" message. :-) Don't know about the rest though, looks like a job or two for Laurence tomorrow (possibly a cut-'n'-paste gone wrong somewhere). ![]() |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 1 ![]() |
This is in preparation for adding the CMS app to LHC@home (where T4T runs). First the CMS-Dev is tried and if this fails, it tries LHC@home. You can ignore LHC@home as this should fail as you are using your CMS@home credential. What is concerning is why it is failing to get a proxy from CMS-Dev. How often does this happen? |
©2025 CERN