Message boards :
Theory Application :
Suspend/Resume Theory
Message board moderation
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
Suspend (LAIM off) of 1 minute was fine. The job proceeded after the resume. When suspending 73 minutes the pythia8 job was not proceeded. I saw several condor processes and a new job (herwig++) was started. No logs except the with every new job overwriting job.out. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
The logs should be there in a new task. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There are only the usual vbox log files in the slot folder and the stderr. No other logs in the shared folder. Console only shows the startuo (f1), Events (f2) and TOP (f3). |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
They are in the Web logs (show graphics) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
F1 console is not much use, as it is scrolling.I can only see anything, when it stopps. The only log in "show graphics" is the job.log, which corresponds to F2 console. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
How old is the task? You should have some Condor Logs now. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
4h. Did you upgrade the VDI file? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
4h. I got a new task (no new BOINC-project files) and now I have these logs: MasterLog ....... 13-Apr-2016 17:07 6.3K StartLog .......... 13-Apr-2016 17:21 18K StarterLog ...... 13-Apr-2016 17:21 35K running.log ..... 13-Apr-2016 17:51 60K - - - > was the old job.out stderr.log ........ 13-Apr-2016 15:21 0 stdout.log ....... 13-Apr-2016 15:21 0 Console ALT-F2 shows no display In MasterLog every hour the message: 04/13/16 17:07:29 PERMISSION DENIED to condor@38-1075-19973 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am getting the same. New task started 40min ago. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Will fix the console ASAP. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
16/17 days later. Testing suspend/resume of Theory jobs overnight. The 3 running jobs were suspended and the VM's state saved to disk. This morning resumed those 3 and they run for a while. When running somewhere in the middle of a job, the VM's were killed after about 7 minutes run time without finishing the current job. The old version on the production project can be suspended for even weeks and deliver a valid job after a resume. I know: very different setup with pilot agent. See the end of the results: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166252 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166210 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166536 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I tested disconnecting the internet for 10 min during event processing. The job appears to be finishing properly. However, it does not start a new job(for 30min now), an sits idle.
I have to wait and see, if the task is terminated, or if it eventually starts a new job. After 41min it got a new job. But: Comix was compiled for multithreading. Matrix_Element_Handler::BuildProcesses(): Looking for processes ... done ( 31 MB, 0s / 0s ). Matrix_Element_Handler::InitializeProcesses(): Performing tests .. Amplitude::GaugeTest(): Large deviation { 16.5897415555 vs 16.5898336882 => -5.55356350729e-06 } . done ( 31 MB, 0s / 0s ). Initialized the Matrix_Element_Handler for the hard processes. Initialized the Beam_Remnant_Handler. Initialized the Soft_Photon_Handler. Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__t__tb' (Comix) Starting the calculation at 11:13:26. Lean back and enjoy ... . WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF! WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF! WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF! WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF! WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF! IT IS PRINTING THIS MILLIONS OF TIMES. Log file 30MB and rising! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Log file size 45MB and rising. 2 other theory tasks have terminated. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169350 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169217 Internet disconnect was from 10.20 to 10.30 local time. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Task appears to have recovered and started a new job. |
Send message Joined: 13 Apr 15 Posts: 138 Credit: 2,969,210 RAC: 9 |
This evening I Snoozed Boinc for c.30mins while I had to use my computer for something other than Boinc (Crazy, I know. Who would do such a thing?). With LAIM set, all 4 VMS, 2xTheory, 1xCMS, 1xVirtual(T4T) showed in VBox as Paused. On un-snooze: The Theory, after a few mins that I didn't time, dumped the Job (within the Task) it was doing and got a new one. The CMS processed 2 more events then similarly dumped the Job and got a new one, as above. New Jobs running fine. A quick look at the logs shows only finished logs for completed runs, with apparently NO log of the dumped Jobs (?) although StarLog shows a few lines: Cron: Killing all jobs CronJobList: Deleting all jobs The T4T resumed its Job where it left off and carried on unaffected. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Please try this with the Theory app. This is where I am working. But don't try until tomorrow as I still need to do an update. (From CMS-thread, yesterday) Is it ready to go? Is there any way to see, if uploded jobs have passed?(similar to dashboard) This way, i could confirm, if any interruptions causes issues. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Disconnecting the internet for 10min causes: After resume the job finishes. The running.log and the last finished x.log are the same. No new Job is started.(for at least 1 hour) No "Job finished" entry in stderr.txt 05/05/16 15:01:38 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11 |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
Disconnecting the internet for 10min causes: No new job started for CMS as well. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 18 |
I've seen the same after a suspend. The current job is resuming normal and finishes (not sure about a valid upload/result), but no new job arrives. All 'nobody' processes gone. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
Suspend/Resume should now work upto 60mins. |
©2024 CERN