Message boards : Theory Application : Suspend/Resume Theory
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 50
Message 2701 - Posted: 13 Apr 2016, 8:09:22 UTC

Suspend (LAIM off) of 1 minute was fine. The job proceeded after the resume.

When suspending 73 minutes the pythia8 job was not proceeded. I saw several condor processes and a new job (herwig++) was started.

No logs except the with every new job overwriting job.out.
ID: 2701 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2702 - Posted: 13 Apr 2016, 9:00:59 UTC - in response to Message 2701.  

The logs should be there in a new task.
ID: 2702 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2703 - Posted: 13 Apr 2016, 9:08:51 UTC - in response to Message 2702.  

There are only the usual vbox log files in the slot folder and the stderr.
No other logs in the shared folder.
Console only shows the startuo (f1), Events (f2) and TOP (f3).
ID: 2703 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2705 - Posted: 13 Apr 2016, 9:45:33 UTC - in response to Message 2703.  

They are in the Web logs (show graphics)
ID: 2705 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2707 - Posted: 13 Apr 2016, 10:03:35 UTC - in response to Message 2705.  
Last modified: 13 Apr 2016, 10:05:47 UTC

F1 console is not much use, as it is scrolling.I can only see anything, when it stopps.
The only log in "show graphics" is the job.log, which corresponds to F2 console.
ID: 2707 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2712 - Posted: 13 Apr 2016, 10:53:15 UTC - in response to Message 2707.  

How old is the task? You should have some Condor Logs now.
ID: 2712 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2715 - Posted: 13 Apr 2016, 10:59:51 UTC - in response to Message 2712.  
Last modified: 13 Apr 2016, 11:00:55 UTC

4h.
Did you upgrade the VDI file?
ID: 2715 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 50
Message 2724 - Posted: 13 Apr 2016, 16:12:56 UTC - in response to Message 2715.  

4h.
Did you upgrade the VDI file?

I got a new task (no new BOINC-project files) and now I have these logs:

MasterLog ....... 13-Apr-2016 17:07 6.3K
StartLog .......... 13-Apr-2016 17:21 18K
StarterLog ...... 13-Apr-2016 17:21 35K
running.log ..... 13-Apr-2016 17:51 60K - - - > was the old job.out
stderr.log ........ 13-Apr-2016 15:21 0
stdout.log ....... 13-Apr-2016 15:21 0

Console ALT-F2 shows no display

In MasterLog every hour the message:

04/13/16 17:07:29 PERMISSION DENIED to condor@38-1075-19973 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE)
ID: 2724 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2727 - Posted: 13 Apr 2016, 16:21:06 UTC - in response to Message 2724.  

I am getting the same. New task started 40min ago.
ID: 2727 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 2734 - Posted: 13 Apr 2016, 17:53:48 UTC - in response to Message 2724.  

Will fix the console ASAP.
ID: 2734 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 50
Message 3111 - Posted: 30 Apr 2016, 5:34:03 UTC

16/17 days later.

Testing suspend/resume of Theory jobs overnight.
The 3 running jobs were suspended and the VM's state saved to disk.
This morning resumed those 3 and they run for a while.
When running somewhere in the middle of a job, the VM's were killed after about 7 minutes run time without finishing the current job.
The old version on the production project can be suspended for even weeks and deliver a valid job after a resume. I know: very different setup with pilot agent.

See the end of the results:
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166252
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166210
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=166536
ID: 3111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3189 - Posted: 3 May 2016, 9:06:38 UTC
Last modified: 3 May 2016, 9:19:07 UTC

I tested disconnecting the internet for 10 min during event processing.

The job appears to be finishing properly.
However, it does not start a new job(for 30min now), an sits idle.

05/03/16 10:20:43 condor_write(): Socket closed when trying to write 1378 bytes to collector alicondor01.cern.ch, fd is 10
05/03/16 10:20:43 Buf::write(): condor_write() failed
05/03/16 10:20:43 attempt to connect to <188.184.129.127:9618> failed: No route to host (connect errno = 113).
05/03/16 10:20:43 ERROR: SECMAN:2003:TCP connection to collector alicondor01.cern.ch failed.
05/03/16 10:20:43 Failed to start non-blocking update to <188.184.129.127:9618>.


I have to wait and see, if the task is terminated, or if it eventually starts a new job.


After 41min it got a new job.
But:
Comix was compiled for multithreading.
Matrix_Element_Handler::BuildProcesses(): Looking for processes ... done ( 31 MB, 0s / 0s ).
Matrix_Element_Handler::InitializeProcesses(): Performing tests ..
Amplitude::GaugeTest(): Large deviation {
16.5897415555
vs 16.5898336882
=> -5.55356350729e-06
}
. done ( 31 MB, 0s / 0s ).
Initialized the Matrix_Element_Handler for the hard processes.
Initialized the Beam_Remnant_Handler.
Initialized the Soft_Photon_Handler.
Process_Group::CalculateTotalXSec(): Calculate xs for '2_2__j__j__t__tb' (Comix)
Starting the calculation at 11:13:26. Lean back and enjoy ... .
WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF!
WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF!
WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF!
WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF!
WARNING in LHAPDF_Fortran_Interface::GetXPDF(G) not supported by this PDF!




IT IS PRINTING THIS MILLIONS OF TIMES.
Log file 30MB and rising!
ID: 3189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3192 - Posted: 3 May 2016, 9:23:58 UTC
Last modified: 3 May 2016, 9:31:39 UTC

Log file size 45MB and rising.
2 other theory tasks have terminated.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169350

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169217

Internet disconnect was from 10.20 to 10.30 local time.
ID: 3192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3193 - Posted: 3 May 2016, 9:35:53 UTC

Task appears to have recovered and started a new job.
ID: 3193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,969,210
RAC: 0
Message 3201 - Posted: 3 May 2016, 19:31:57 UTC
Last modified: 3 May 2016, 19:34:52 UTC

This evening I Snoozed Boinc for c.30mins while I had to use my computer for something other than Boinc (Crazy, I know. Who would do such a thing?). With LAIM set, all 4 VMS, 2xTheory, 1xCMS, 1xVirtual(T4T) showed in VBox as Paused.
On un-snooze:
The Theory, after a few mins that I didn't time, dumped the Job (within the Task) it was doing and got a new one.
The CMS processed 2 more events then similarly dumped the Job and got a new one, as above.
New Jobs running fine.
A quick look at the logs shows only finished logs for completed runs, with apparently NO log of the dumped Jobs (?) although StarLog shows a few lines:
Cron: Killing all jobs
CronJobList: Deleting all jobs

The T4T resumed its Job where it left off and carried on unaffected.
ID: 3201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3258 - Posted: 5 May 2016, 9:44:48 UTC

Please try this with the Theory app. This is where I am working. But don't try until tomorrow as I still need to do an update.

(From CMS-thread, yesterday)

Is it ready to go?

Is there any way to see, if uploded jobs have passed?(similar to dashboard)

This way, i could confirm, if any interruptions causes issues.
ID: 3258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3267 - Posted: 5 May 2016, 14:07:53 UTC
Last modified: 5 May 2016, 14:12:46 UTC

Disconnecting the internet for 10min causes:

After resume the job finishes.
The running.log and the last finished x.log are the same.

No new Job is started.(for at least 1 hour)

No "Job finished" entry in stderr.txt

05/05/16 15:01:38 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11
05/05/16 15:01:38 Buf::write(): condor_write() failed
05/05/16 15:01:38 Failed to send job exit status to shadow
05/05/16 15:01:38 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
ID: 3267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 3268 - Posted: 5 May 2016, 14:15:11 UTC - in response to Message 3267.  

Disconnecting the internet for 10min causes:

After resume the job finishes.
The running.log and the last finished x.log are the same.

No new Job is started.(for at least 1 hour)

No "Job finished" entry in stderr.txt

05/05/16 15:01:38 condor_write(): Socket closed when trying to write 365 bytes to <188.184.187.167:9618>, fd is 11
05/05/16 15:01:38 Buf::write(): condor_write() failed
05/05/16 15:01:38 Failed to send job exit status to shadow
05/05/16 15:01:38 JobExit() failed, waiting for job lease to expire or for a reconnect attempt

No new job started for CMS as well.
ID: 3268 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 50
Message 3270 - Posted: 5 May 2016, 14:47:56 UTC

I've seen the same after a suspend.

The current job is resuming normal and finishes (not sure about a valid upload/result), but no new job arrives.

All 'nobody' processes gone.
ID: 3270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 3286 - Posted: 6 May 2016, 14:42:31 UTC - in response to Message 2701.  

Suspend/Resume should now work upto 60mins.
ID: 3286 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : Theory Application : Suspend/Resume Theory


©2024 CERN