Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 1369 - Posted: 29 Oct 2015, 10:39:20 UTC - in response to Message 1367.  
Last modified: 29 Oct 2015, 11:05:07 UTC

That's what happens here. It loses work in progress whenever BOINC is shut down.
Later versions of BOINC can save the VM when shut down gracefully (BOINC does
a lot of cleaning up at the end of it's task, so the VM is removed then) but the
actual CMS job is "abandoned" and a new one is picked up. It isn't as bad as it
sounds.Conventional BOINC projects, well most of them, use checkpoints and this
behaviour is equivalent to having a checkpoint interval equal to the CMS cycle
length... or have I got something horribly wrong?

Edit. It is (or was) possible to make the BOINC task run time whatever you want
(within reason) by using the anonymous platform mechanism which let you use
your own job file. I don't do this now but it used to work very well for T4T
with run times of 48h or longer. Not sure if it would work here, and there is
still the BOINC task deadline which may need to be increased.
It was also possible to run the VM outside BOINC (as I remember, Tullio
did this very successfully) that way, the VM ran continuously. Updates had to be done manually though.
ID: 1369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 199
Message 1376 - Posted: 29 Oct 2015, 14:34:30 UTC - in response to Message 1367.  

Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour.
ID: 1376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 825,079
RAC: 1,059
Message 1377 - Posted: 29 Oct 2015, 15:13:14 UTC - in response to Message 1376.  
Last modified: 29 Oct 2015, 15:15:40 UTC

Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour.

It does:

completion_trigger_file
This provides a more bulletproof way for VM apps to exit; sometimes VMs fail to shut down, and the task hangs indefinitely.
When the VM app is done, it writes a file of this name in the shared directory;
the file can optionally contain an integer exit code (first line) a bool value for whether it should bubble up to the volunteer (second line) and stderr text (subsequent lines).
If vboxwrapper finds this file, it cleans up the VM and exits with the given code (default 0).

You have to add the lines

<enable_shared_directory/>
<completion_trigger_file>filename</completion_trigger_file>

to your Vbox job description file.
ID: 1377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1378 - Posted: 29 Oct 2015, 15:13:23 UTC - in response to Message 1376.  

There is not much point optimizing job length, if the last job in a 24h period is lost.
The longer the job, the more is lost at the end.
As far as i am concerned, this is an issue that deserves high priority.
ID: 1378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 825,079
RAC: 1,059
Message 1380 - Posted: 29 Oct 2015, 20:55:05 UTC - in response to Message 1377.  

Yes, this is the case and something that we need to optimize. Ideally if the VM does not have enough time to run another job it should shut itself down as a signal to the vboxwrapper that it has finished the task but I am not sure that the vboxwrapper supports such behaviour.

It does:

completion_trigger_file
This provides a more bulletproof way for VM apps to exit; sometimes VMs fail to shut down, and the task hangs indefinitely.
When the VM app is done, it writes a file of this name in the shared directory;
the file can optionally contain an integer exit code (first line) a bool value for whether it should bubble up to the volunteer (second line) and stderr text (subsequent lines).
If vboxwrapper finds this file, it cleans up the VM and exits with the given code (default 0).

You have to add the lines

<enable_shared_directory/>
<completion_trigger_file>filename</completion_trigger_file>

to your Vbox job description file.

I simulated the use of the tags with success.
The only thing you have to implement in the VM is the check whether there is time enough left for at least 1 whole job and the writing of a "file to stop" into the shared directory.


2015-10-29 21:43:19 (7868): VM Completion File Detected.
2015-10-29 21:43:19 (7868): Powering off VM.
2015-10-29 21:43:21 (7868): Successfully stopped VM.
2015-10-29 21:43:26 (7868): Deregistering VM. (boinc_b811b2fa61782689, slot#0)
2015-10-29 21:43:26 (7868): Removing virtual disk drive(s) from VM.
2015-10-29 21:43:26 (7868): Removing network bandwidth throttle group from VM.
2015-10-29 21:43:26 (7868): Removing storage controller(s) from VM.
2015-10-29 21:43:26 (7868): Removing VM from VirtualBox.
21:43:31 (7868): called boinc_finish(0)
ID: 1380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,668,904
RAC: 14,951
Message 1402 - Posted: 1 Nov 2015, 14:49:51 UTC - in response to Message 1380.  

Issue of the day today was a 90 minute power cut !

On resuming, the 24hr task that was due to finish first has reset the percent counter back to 0 (but is increasing). The Graphics and Show VM buttons are greyed out but I found the web page address and looked at what was there:

[ ] boot.log 01-Nov-2015 13:57 11K
[TXT] cron-stderr 31-Oct-2015 14:18 0
[TXT] cron-stdout 01-Nov-2015 13:57 4.8K
[DIR] run-0/ 01-Nov-2015 13:58 -

All that remains of the original structure/files is the cron-stderr file, everything else is new since the reboot. My question is why all the other files/folders have been deleted but the cron-stderr file survived ?
ID: 1402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1403 - Posted: 1 Nov 2015, 16:10:10 UTC - in response to Message 1402.  

Possibly it was the only thing that had been written to the VM's disk image at the time of the outage. Alternatively, it may be the only thing not written to since the recovery. I believe a Linux directory's timestamp is updated when a file within the directory is altered. Moral: VMs aren't really designed to cope with unexpected power cuts?
ID: 1403 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1414 - Posted: 2 Nov 2015, 14:40:30 UTC
Last modified: 2 Nov 2015, 14:52:39 UTC

I have a finished job 347 for the 75events/job lot.
(Task: 151030_103550:ireid_crab_CMS_at_Home_TTbar_75ev)

The task monitor on crab3 user monitor reports a walltime of 1h10m.

However the CMS-run_stdout.log reports a total time of 10491 sec, which is almost 3h.
Why is that?
All other jobs have a very similar wallclock to total time.(8 jobs)

What exactly is walltime in this case?
ID: 1414 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1415 - Posted: 2 Nov 2015, 15:44:18 UTC - in response to Message 1414.  

I think I'd put that down to the general lack of trustworthiness of Dashboard. It does get things wrong sometimes, possibly as a result of a garbled message from the Condor server. I can't see anything in the log file to support that short wall-time.
ID: 1415 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1416 - Posted: 2 Nov 2015, 16:19:19 UTC - in response to Message 1415.  

Thanks for the reply.
Has any progress been made on the proper terminating of the 24h boinc task?
(without the loss of the final job)
ID: 1416 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 199
Message 1419 - Posted: 2 Nov 2015, 21:58:27 UTC - in response to Message 1416.  

No, but it is near the top of the todo list.
ID: 1419 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1425 - Posted: 5 Nov 2015, 3:09:10 UTC

Got project message that new wrapper was in place to fix "VirtualBox not running" on Mac (I think, it was talking about Mac path changes anyway). Still getting same error here - unless there are still old tasks in the queue?

Silly question... since the task is not being downloaded and failing, but is never being downloaded at all, how is a wrapper change going to fix the problem?
ID: 1425 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1426 - Posted: 5 Nov 2015, 3:11:36 UTC - in response to Message 1425.  

Never mind, that message was from 10/22, about wrapper 26178. Already knew that didn't change anything.
ID: 1426 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1431 - Posted: 7 Nov 2015, 15:56:53 UTC

"VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes...

User 306, task 69452, wu 64594, host 782.
ID: 1431 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1432 - Posted: 7 Nov 2015, 18:45:03 UTC - in response to Message 1431.  

"VirtualBox exited unexpectedly" on Linux system error message; CMS BOINC task showed in "Running" state with % Complete 100%, time remaining "-". Rebooted to get clean VBox start, no change, still 100% complete but running. Gave it another couple hours then aborted it. I see nothing odd in stderr - looks like it finished okay! (26 hours run time, no credit). Problem when it shut down VBox? Big issue if a task NEVER completes...

User 306, task 69452, wu 64594, host 782.

Thanks for the report, Bill -- what time-zone are you in, looks to be UTC +6? Nothing leaps out at me in the stderr report. That host appears to have had just two jobs during the task
151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.3706.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:42:54 GMT 2015 on 306-782-14078 with (short) status 65 ========
151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3/job_out.385.1.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Fri Nov 6 16:19:19 GMT 2015 on 306-782-14078 with (short) status 65 ========

Was it doing other work at the time?

Hmm, the log isn't good:
== CMSSW: cmsRun -j FrameworkJobReport.xml PSet.py
== CMSSW: ----- Begin Fatal Exception 06-Nov-2015 17:19:15 CET-----------------------
== CMSSW: An exception of category 'PluginLibraryLoadError' occurred while
== CMSSW: [0] Constructing the EventProcessor
== CMSSW: [1] Constructing ESSource: class=PoolDBESSource label='GlobalTag'
== CMSSW: Exception Message:
== CMSSW: unable to load /cvmfs/cms.cern.ch/slc6_amd64_gcc472/cms/cmssw/CMSSW_6_2_0_SLHC26/lib/
slc6_amd64_gcc472/pluginJetMETCorrectionsModulesPlugins.so
because libDataFormatsRecoCandidate.so: cannot open shared object file: No such file or directory
== CMSSW: ----- End Fatal Exception ------------------------------------------------
== CMSSW: Complete
== CMSSW: process id is 10043 status is 65


It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image.
ID: 1432 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1434 - Posted: 7 Nov 2015, 21:09:50 UTC - in response to Message 1432.  

what time-zone are you in, looks to be UTC +6?


Believe so - Dallas/Chicago Central Time.

Was it doing other work at the time?


DENIS@Home, Malariacontrol, CSG probably; have CMS set at 50% to minimize swapping, but 1-task limit keeps it down to 1 core per host. That box has 4 (slow) cores, AMD APU.

{quote]It looks to me like the /cvmfs file system in your VM might have become corrupted, possibly due to a network interruption during an update (cvmfs is a read-only file system used by CMS to distribute information; it is locally cached and updated as files are read and synched something like rsync does). I'd suggest you do a project reset on that host to get a clean VM image.[/quote]

Will do! Thanks.
ID: 1434 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1435 - Posted: 7 Nov 2015, 23:02:55 UTC - in response to Message 1434.  

Cheers!
ID: 1435 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,668,904
RAC: 14,951
Message 1443 - Posted: 12 Nov 2015, 19:42:13 UTC

I'm getting the error "Cloud not get a proxy from CMS-Dev" !
Then says going to sleep for an hour, which it does :-)

This is on 24 hour jobs that started after you announced the fix/update.

Just noticed that now it tries
"Requesting an X509 credential from CMS-Dev" followed by
"Requesting an X509 credential from LHC@home"
Don't know when that started happening.

Would have cut and pasted but I can't remember what keystrokes let me highlight and copy the text from ALT-F1.
ID: 1443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1445 - Posted: 12 Nov 2015, 21:49:53 UTC - in response to Message 1443.  

I'm getting the error "Cloud not get a proxy from CMS-Dev" !
Then says going to sleep for an hour, which it does :-)

This is on 24 hour jobs that started after you announced the fix/update.

Just noticed that now it tries
"Requesting an X509 credential from CMS-Dev" followed by
"Requesting an X509 credential from LHC@home"
Don't know when that started happening.

Would have cut and pasted but I can't remember what keystrokes let me highlight and copy the text from ALT-F1.

I noticed that typo when I was looking up Yeti's "user-ID not found" message. :-)
Don't know about the rest though, looks like a job or two for Laurence tomorrow (possibly a cut-'n'-paste gone wrong somewhere).
ID: 1445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 199
Message 1449 - Posted: 13 Nov 2015, 12:07:14 UTC - in response to Message 1445.  

This is in preparation for adding the CMS app to LHC@home (where T4T runs). First the CMS-Dev is tried and if this fails, it tries LHC@home.

You can ignore LHC@home as this should fail as you are using your CMS@home credential. What is concerning is why it is failing to get a proxy from CMS-Dev.

How often does this happen?
ID: 1449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN