Message boards : CMS Application : Jobs terminating immediately
Message board moderation

To post messages, you must log in.

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3120 - Posted: 30 Apr 2016, 10:33:33 UTC
Last modified: 30 Apr 2016, 11:01:58 UTC

On Windows hosts, jobs terminate as soon as they start:-

2016-04-30 01:22:13 (2352): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev2016-04-30 01:22:25 (2352): Guest Log: [INFO] CMS application starting. Check log files.
2016-04-30 01:22:25 (2352): Guest Log: [INFO] Condor exited with 0
2016-04-30 01:22:25 (2352): Guest Log: [INFO] Shutting Down.
2016-04-30 01:22:25 (2352): VM Completion File Detected.
2016-04-30 01:22:25 (2352): VM Completion Message: Condor exited with 0
.
2016-04-30 01:22:25 (2352): Powering off VM.


I've reset the project in case it was a corrupted VM (although there are two affected hosts). Linux not affected. Not sure ('cos I was changing other things as well), but I think it also occurred with v47.01.

Edit;- This is Win7 - don't have any other version, VBox 4.3.30 and 4.3.26. BOINC 7.6.22, the unaffected Linux hosts are VBox 4.3.26 and BOINC 7.4.25.
ID: 3120 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3121 - Posted: 30 Apr 2016, 10:59:46 UTC - in response to Message 3120.  

I do not know, but your vbox version seems very old.
I would try something 5.0.x .
ID: 3121 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3134 - Posted: 1 May 2016, 1:13:44 UTC - in response to Message 3121.  
Last modified: 1 May 2016, 1:19:06 UTC

Never thought it might be VBox... thanks for the prod. Re-installed 4.3.26 (to avoid having to download another version) Now running OK... at this end anyway.
ID: 3134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Ray Murray
Avatar

Send message
Joined: 13 Apr 15
Posts: 138
Credit: 2,945,852
RAC: 1
Message 3138 - Posted: 1 May 2016, 9:46:50 UTC

Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here?
ID: 3138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 3139 - Posted: 1 May 2016, 9:50:06 UTC - in response to Message 3138.  

Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here?

Very likely: Queued: 0 Running: 20
ID: 3139 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3140 - Posted: 1 May 2016, 9:50:49 UTC - in response to Message 3138.  

I think, you are correct.
We are out of jobs.
As all are failing, probably not so bad.
ID: 3140 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3141 - Posted: 1 May 2016, 9:54:31 UTC - in response to Message 3140.  

Oops, my mistake I think. I was still out of sorts yesterday after Friday's misadventures and forgot to renew the 7-day proxy on the last batch of jobs. I'll go and submit a new batch immediately, hopefully we'll be under way again in about 20 minutes.
ID: 3141 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 3142 - Posted: 1 May 2016, 10:16:31 UTC - in response to Message 3141.  

Better now?
ID: 3142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3144 - Posted: 1 May 2016, 10:21:26 UTC - in response to Message 3138.  
Last modified: 1 May 2016, 10:25:08 UTC

Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here?


Don't think so, at least not to start with. Not all hosts were affected and the "CMS Jobs" link still showed work. I didn't install all the VBox components originally whereas I did on the re-install so maybe something that didn't affect v47.01 was needed for v47.02.

It would help if an explicit message could be generated in stderr to indicate "no work" as a likely cause. Better still, since volunteers probably don't have significant work buffers for these projects, if a link could be established between the VM job queue and the supply of BOINC tasks so that no VM work = no BOINC work and the client could behave in it's normal way and re-allocate slots.
ID: 3144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3169 - Posted: 2 May 2016, 19:01:33 UTC - in response to Message 3144.  
Last modified: 2 May 2016, 19:07:52 UTC

Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here?

Now I think you were right, Ray. I think I know the root cause but not how it happens. The client thinks that there are no jobs... but there are. Can one of the experts please explain exactly how the client knows that there are no jobs. For example, if it's a timeout, when does it start and how long is it. Perhaps the new exit codes will help, but they need to appear in stderr.

I caught this message on the F1 console for a recent failure, although it's not always present.

ID: 3169 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 3171 - Posted: 2 May 2016, 19:16:08 UTC - in response to Message 3169.  
Last modified: 2 May 2016, 19:18:11 UTC

I think the error you reported has already been fixed today.

The lifetime of the task should be 12 hours, i.e it should accept new jobs up to that deadline. If it already has a job it may run over that deadline for up to 18 hours. If it does not receive a new task within 5 minutes of the previous one, it will shut down with success. If a new task starts up and does not receive any job it will shut down after 5 minutes with a failure.


EDIT: So if a task runs a few jobs, finishes the last one and shuts down, don't worry. If a task only lasts for less than 20 mins and does not give an informative error message, worry. :)
ID: 3171 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3175 - Posted: 2 May 2016, 20:15:17 UTC - in response to Message 3171.  

So if a task runs a few jobs, finishes the last one and shuts down, don't worry. If a task only lasts for less than 20 mins and does not give an informative error message, worry.

Is this from now on, or can I apply this to previous tasks, such as these? . The root cause, I think, is a loaded ADSL connection, and peaks of project traffic, for example when machines start or recover from a hiatus such as proxy expiry. This has already been commented upon by Ivan. I don't know if it's simply slow data, upload, download or dropped packets or whatever. In any event, it looks to me at the moment that the project needs to be more tolerant of the effect (slow?) connections have on traffic peaks; especially in a production environment.
ID: 3175 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 3178 - Posted: 2 May 2016, 20:24:00 UTC - in response to Message 3175.  

From now on.

Please bear in mind that some of these workloads are resource intensive, which is one of the reasons this was not done 10 years ago. Yes, the system should be robust but there will be a certain performance threshold that is required.
ID: 3178 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3180 - Posted: 2 May 2016, 20:39:52 UTC - in response to Message 3178.  

From now on.


OK, thanks.
ID: 3180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 249
Message 3181 - Posted: 2 May 2016, 20:56:27 UTC - in response to Message 3120.  

I think there were some issues on Saturday but I see similar issues from today.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168949

2016-05-02 05:39:36 (826): Guest Log: [INFO] CMS application starting. Check log files.
2016-05-02 05:39:43 (826): Guest Log: [INFO] Condor exited with 0


What does it say in the MasterLog and StartLog?
ID: 3181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 875,205
RAC: 450
Message 3182 - Posted: 2 May 2016, 21:58:54 UTC - in response to Message 3181.  

I think there were some issues on Saturday but I see similar issues from today.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168949

2016-05-02 05:39:36 (826): Guest Log: [INFO] CMS application starting. Check log files.
2016-05-02 05:39:43 (826): Guest Log: [INFO] Condor exited with 0


What does it say in the MasterLog and StartLog?


MasterLog contains (along with much else)

05/02/16 18:55:13 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1391288830)
05/02/16 18:55:13 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 4173
05/02/16 18:55:26 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
05/02/16 19:51:58 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
05/02/16 19:52:03 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
05/02/16 19:55:13 Preen pid is 4982
05/02/16 19:55:14 DefaultReaper unexpectedly called on pid 4982, status 0.
05/02/16 20:24:13 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
05/02/16 20:24:13 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
05/02/16 20:24:23 condor_write(): Socket closed when trying to write 1023 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 10


and the similar period from StartLog:-

05/02/16 18:56:26 Got activate_claim request from shadow (130.246.180.120)
05/02/16 18:56:26 Remote job ID is 1118866.0
05/02/16 18:56:26 Got universe "VANILLA" (5) from request classad
05/02/16 18:56:26 State change: claim-activation protocol successful
05/02/16 18:56:26 Changing activity: Idle -> Busy
05/02/16 18:56:30 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
05/02/16 19:51:58 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 8
05/02/16 19:51:58 Buf::write(): condor_write() failed
05/02/16 19:51:58 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 1 of 3): CEDAR:6002:failed to send EOM
05/02/16 19:58:22 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
05/02/16 19:58:27 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
05/02/16 20:24:13 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
05/02/16 20:24:13 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
05/02/16 20:24:31 condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 9
05/02/16 20:24:31 Buf::write(): condor_write() failed
05/02/16 20:25:14 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#50623
05/02/16 20:48:30 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9
05/02/16 20:48:30 Buf::write(): condor_write() failed
05/02/16 20:48:30 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 1 of 3): CEDAR:6002:failed to send EOM
05/02/16 20:48:35 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9
05/02/16 20:48:35 Buf::write(): condor_write() failed
05/02/16 20:48:35 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 2 of 3): CEDAR:6002:failed to send EOM|CEDAR:6002:failed to send EOM
05/02/16 20:48:40 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9
05/02/16 20:48:40 Buf::write(): condor_write() failed
05/02/16 20:48:40 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 3 of 3): CEDAR:6002:failed to send EOM|CEDAR:6002:failed to send EOM|CEDAR:6002:failed to send EOM
05/02/16 21:00:14 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason
05/02/16 21:00:19 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason


Is this what you mean?
ID: 3182 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3293 - Posted: 9 May 2016, 11:20:17 UTC
Last modified: 9 May 2016, 11:23:58 UTC

I have noticed, that when a task ends, it does not allways get a good tasks immediatly.
There can be one or two 7 min tasks, before it starts a normal one.
Jobs should have been present,so there was no reason, not to start proplerly.
(I had this happen twice)

I would also like to have the Job Ids back in the stderr. The Condor Ids are not much use on dashboard.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=172743

There is still no usful exit reason in the log.
ID: 3293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3485 - Posted: 24 May 2016, 13:25:39 UTC

It seems, that if a waiting task is suspended, and resumed and eventually starts, it will end after a few minutes.
More than that, any task, that is suspened while not fully booted to actually run a job, will end after 6 to 7 minutes.
ID: 3485 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : Jobs terminating immediately


©2024 CERN