Message boards : Theory Application : New Version v47.22
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3766 - Posted: 21 Jul 2016, 13:52:29 UTC

This version updates HTCondor to version 8.4.8 which supports the MAX_TIME_SKIP configuration attribute so we no longer have to use a binary patched library to support suspend/resume.
ID: 3766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3767 - Posted: 21 Jul 2016, 17:18:49 UTC

What are the limits to the suspend period?

I suspended a new v2.02 mt_mcore task with 4 jobs after just 8 minutes run time.
I'll wait 1hr3m and see what happens with the jobs after the resume.
ID: 3767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3768 - Posted: 21 Jul 2016, 18:37:39 UTC - in response to Message 3767.  

After the resume (task was suspended at 18:53 with LAIM off, so VM was saved into a snapshot to disk) the 4 jobs went on and the connection was reestablished.

All StarterLog.slot 1,2,3,4 similar contents:

07/21/16 18:45:23 (pid:4254) Create_Process succeeded, pid=4265
07/21/16 19:56:44 (pid:4254) CCBListener: no activity from CCB server in 3981s; assuming connection is dead.
07/21/16 19:56:44 (pid:4254) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/21/16 19:56:44 (pid:4254) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
07/21/16 19:56:44 (pid:4254) IO: Failed to read packet header
07/21/16 19:56:44 (pid:4254) Lost connection to shadow, waiting 86300 secs for reconnect
07/21/16 19:57:45 (pid:4254) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1165792
07/21/16 20:01:56 (pid:4254) Accepted request to reconnect from <:0>
07/21/16 20:01:56 (pid:4254) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838>
07/21/16 20:01:56 (pid:4254) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_458838>
07/21/16 20:01:56 (pid:4254) Recovered connection to shadow after 312 seconds
ID: 3768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3770 - Posted: 21 Jul 2016, 22:45:40 UTC - in response to Message 3767.  

The limit is 24h
ID: 3770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3771 - Posted: 22 Jul 2016, 5:06:09 UTC
Last modified: 22 Jul 2016, 5:21:17 UTC

Since about 5.44UTC there seem to be no jobs available.

BTW. Boinc-tasks should not error out, if no work is available.

EDIT:And this message shows at thr top of stderr.

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Der Ring 2-Stapel wird bereits verwendet.
(0xcf) - exit code 207 (0xcf)
</message>
ID: 3771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3772 - Posted: 22 Jul 2016, 5:21:07 UTC - in response to Message 3770.  
Last modified: 22 Jul 2016, 5:51:53 UTC

I did the 'overnight'-suspend test. The resume seems to be successful.

07/21/16 21:51:45 (pid:84374) Running job as user nobody
07/21/16 21:51:45 (pid:84374) Create_Process succeeded, pid=84384
07/22/16 07:07:51 (pid:84374) CCBListener: no activity from CCB server in 29767s; assuming connection is dead.
07/22/16 07:07:51 (pid:84374) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/22/16 07:07:51 (pid:84374) condor_read() failed: recv(fd=10) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
07/22/16 07:07:51 (pid:84374) IO: Failed to read packet header
07/22/16 07:07:51 (pid:84374) Lost connection to shadow, waiting 86300 secs for reconnect
07/22/16 07:08:51 (pid:84374) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1168778
07/22/16 07:12:46 (pid:84374) Accepted request to reconnect from <:0>
07/22/16 07:12:46 (pid:84374) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802>
07/22/16 07:12:46 (pid:84374) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_459802>
07/22/16 07:12:46 (pid:84374) Recovered connection to shadow after 295 seconds


After the 4 jobs have finished, no new job is ordered. Maybe because no jobs available, but more likely (it was working that way in the past), because the (overnight) wall clock time is treated like elapsed time for the VM and so the 12 hours are over. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=226241
ID: 3772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 3773 - Posted: 22 Jul 2016, 6:55:02 UTC - in response to Message 3771.  

Since about 5.44UTC there seem to be no jobs available.

BTW. Boinc-tasks should not error out, if no work is available.

EDIT:And this message shows at thr top of stderr.

<core_client_version>7.6.22</core_client_version>
<![CDATA[
<message>
Der Ring 2-Stapel wird bereits verwendet.
(0xcf) - exit code 207 (0xcf)
</message>

Yes, there was a problem over the last few hours with both job supply and result processing on the Theory Applicationt. This has now been fixed so things should clear up steadily.

@CP, this also explains your lack of jobs after resuming...
ID: 3773 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3774 - Posted: 22 Jul 2016, 9:02:19 UTC - in response to Message 3773.  

@CP, this also explains your lack of jobs after resuming...

Hi Ben,

I don't think so.
With a new started task, I directly got new jobs.

07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot2
07:53:14 +0200 2016-07-22 [INFO] New Job Starting in slot1
07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot4
07:57:12 +0200 2016-07-22 [INFO] New Job Starting in slot3


As explained the VM is treating pausing/suspend time as elapse time.
A VM started at e.g. 06:00, paused at 07:00 and resumed at e.g. 18:30 will end the current saved jobs and
not requesting new jobs, because the 12 hours are over and drain until the last job also finished (or 18hrs).

Not a real problem, but nice to know. At least VM's now survive longer periods of suspension. Well done.
ID: 3774 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,531
RAC: 199
Message 3775 - Posted: 22 Jul 2016, 9:50:52 UTC - in response to Message 3774.  

I will put the new image into production as soon as I can.
ID: 3775 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3776 - Posted: 22 Jul 2016, 10:00:51 UTC - in response to Message 3775.  

Please defragmet the image before deployment. (for conventional HDDs)

Obviously, this only helps, if the volunteer keep the image defragmented as well.
ID: 3776 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3780 - Posted: 22 Jul 2016, 13:16:18 UTC

<app_version>
<app_name>Theory</app_name>
<plan_class>vbox64_mt_mcore</plan_class>
<avg_ncpus>c</avg_ncpus>
<cmdline>--nthreads t</cmdline>
<cmdline>--memory_size_mb mmm</cmdline>
</app_version>


Has absolutely no effect.
I guess the wrapper cannot handle this.
ID: 3780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 846,901
RAC: 2,193
Message 3781 - Posted: 22 Jul 2016, 14:49:14 UTC - in response to Message 3775.  

I will put the new image into production as soon as I can.

Repeated my longer suspend test with about 4 hours suspension, but this time all 4 saved jobs were aborted and new ones were started.

07/22/16 11:01:17 (pid:123913) Create_Process succeeded, pid=123923
07/22/16 16:24:33 (pid:123913) condor_read() failed: recv(fd=8) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from <188.184.187.167:9618>.
07/22/16 16:24:33 (pid:123913) IO: Failed to read packet header
07/22/16 16:24:33 (pid:123913) Lost connection to shadow, waiting 86300 secs for reconnect
07/22/16 16:24:33 (pid:123913) CCBListener: no activity from CCB server in 12484s; assuming connection is dead.
07/22/16 16:24:33 (pid:123913) CCBListener: connection to CCB server alicondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/22/16 16:25:34 (pid:123913) CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#1184889
07/22/16 16:29:47 (pid:123913) Accepted request to reconnect from <:0>
07/22/16 16:29:47 (pid:123913) Ignoring old shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837>
07/22/16 16:29:47 (pid:123913) Communicating with shadow <188.184.187.167:9618?addrs=188.184.187.167-9618&noUDP&sock=548029_c1e6_466837>
07/22/16 16:29:47 (pid:123913) Recovered connection to shadow after 314 seconds
07/22/16 16:29:47 (pid:123913) Got SIGTERM. Performing graceful shutdown.
07/22/16 16:29:47 (pid:123913) ShutdownGraceful all jobs.
07/22/16 16:29:47 (pid:123913) Process exited, pid=123923, signal=15
ID: 3781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 181
Message 3782 - Posted: 22 Jul 2016, 15:14:59 UTC - in response to Message 3774.  


As explained the VM is treating pausing/suspend time as elapse time.
A VM started at e.g. 06:00, paused at 07:00 and resumed at e.g. 18:30 will end the current saved jobs and
not requesting new jobs, because the 12 hours are over and drain until the last job also finished (or 18hrs).

Not a real problem, but nice to know. At least VM's now survive longer periods of suspension. Well done.


It's a problem here, admittedly much worse with CMS than Theory. If a VM starts but hasn't completed a job before the host shuts down, restart is >18h later and the task often fails "Condor exited without running a job" error 206. Like this.
ID: 3782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3783 - Posted: 22 Jul 2016, 18:29:10 UTC

Again,no jobs available since about 17.00UTC.
ID: 3783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : New Version v47.22


©2024 CERN