Thread 'Jobs terminating immediately'

Author	Message
m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3120 - Posted: 30 Apr 2016, 10:33:33 UTC Last modified: 30 Apr 2016, 11:01:58 UTC On Windows hosts, jobs terminate as soon as they start:- 2016-04-30 01:22:13 (2352): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev2016-04-30 01:22:25 (2352): Guest Log: [INFO] CMS application starting. Check log files. 2016-04-30 01:22:25 (2352): Guest Log: [INFO] Condor exited with 0 2016-04-30 01:22:25 (2352): Guest Log: [INFO] Shutting Down. 2016-04-30 01:22:25 (2352): VM Completion File Detected. 2016-04-30 01:22:25 (2352): VM Completion Message: Condor exited with 0 . 2016-04-30 01:22:25 (2352): Powering off VM. I've reset the project in case it was a corrupted VM (although there are two affected hosts). Linux not affected. Not sure ('cos I was changing other things as well), but I think it also occurred with v47.01. Edit;- This is Win7 - don't have any other version, VBox 4.3.30 and 4.3.26. BOINC 7.6.22, the unaffected Linux hosts are VBox 4.3.26 and BOINC 7.4.25. ID: 3120 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3121 - Posted: 30 Apr 2016, 10:59:46 UTC - in response to Message 3120. I do not know, but your vbox version seems very old. I would try something 5.0.x . ID: 3121 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3134 - Posted: 1 May 2016, 1:13:44 UTC - in response to Message 3121. Last modified: 1 May 2016, 1:19:06 UTC Never thought it might be VBox... thanks for the prod. Re-installed 4.3.26 (to avoid having to download another version) Now running OK... at this end anyway. ID: 3134 · Rating: 0 · rate: / Reply Quote

Ray Murray Send message Joined: 13 Apr 15 Posts: 138 Credit: 3,015,630 RAC: 0	Message 3138 - Posted: 1 May 2016, 9:46:50 UTC Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here? ID: 3138 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1258 Credit: 1,022,289 RAC: 164	Message 3139 - Posted: 1 May 2016, 9:50:06 UTC - in response to Message 3138. Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here? Very likely: Queued: 0 Running: 20 ID: 3139 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3140 - Posted: 1 May 2016, 9:50:49 UTC - in response to Message 3138. I think, you are correct. We are out of jobs. As all are failing, probably not so bad. ID: 3140 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,357,252 RAC: 547	Message 3141 - Posted: 1 May 2016, 9:54:31 UTC - in response to Message 3140. Oops, my mistake I think. I was still out of sorts yesterday after Friday's misadventures and forgot to renew the 7-day proxy on the last batch of jobs. I'll go and submit a new batch immediately, hopefully we'll be under way again in about 20 minutes. ID: 3141 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,357,252 RAC: 547	Message 3142 - Posted: 1 May 2016, 10:16:31 UTC - in response to Message 3141. Better now? ID: 3142 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3144 - Posted: 1 May 2016, 10:21:26 UTC - in response to Message 3138. Last modified: 1 May 2016, 10:25:08 UTC Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here? Don't think so, at least not to start with. Not all hosts were affected and the "CMS Jobs" link still showed work. I didn't install all the VBox components originally whereas I did on the re-install so maybe something that didn't affect v47.01 was needed for v47.02. It would help if an explicit message could be generated in stderr to indicate "no work" as a likely cause. Better still, since volunteers probably don't have significant work buffers for these projects, if a link could be established between the VM job queue and the supply of BOINC tasks so that no VM work = no BOINC work and the client could behave in it's normal way and re-allocate slots. ID: 3144 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3169 - Posted: 2 May 2016, 19:01:33 UTC - in response to Message 3144. Last modified: 2 May 2016, 19:07:52 UTC Is this maybe just the early finish of a Task when there are no Jobs available, so that Tasks aren't idling unnecessarily, which has been working with Theory, now being implemented here? Now I think you were right, Ray. I think I know the root cause but not how it happens. The client thinks that there are no jobs... but there are. Can one of the experts please explain exactly how the client knows that there are no jobs. For example, if it's a timeout, when does it start and how long is it. Perhaps the new exit codes will help, but they need to appear in stderr. I caught this message on the F1 console for a recent failure, although it's not always present. ID: 3169 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0	Message 3171 - Posted: 2 May 2016, 19:16:08 UTC - in response to Message 3169. Last modified: 2 May 2016, 19:18:11 UTC I think the error you reported has already been fixed today. The lifetime of the task should be 12 hours, i.e it should accept new jobs up to that deadline. If it already has a job it may run over that deadline for up to 18 hours. If it does not receive a new task within 5 minutes of the previous one, it will shut down with success. If a new task starts up and does not receive any job it will shut down after 5 minutes with a failure. EDIT: So if a task runs a few jobs, finishes the last one and shuts down, don't worry. If a task only lasts for less than 20 mins and does not give an informative error message, worry. :) ID: 3171 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3175 - Posted: 2 May 2016, 20:15:17 UTC - in response to Message 3171. So if a task runs a few jobs, finishes the last one and shuts down, don't worry. If a task only lasts for less than 20 mins and does not give an informative error message, worry. Is this from now on, or can I apply this to previous tasks, such as these? . The root cause, I think, is a loaded ADSL connection, and peaks of project traffic, for example when machines start or recover from a hiatus such as proxy expiry. This has already been commented upon by Ivan. I don't know if it's simply slow data, upload, download or dropped packets or whatever. In any event, it looks to me at the moment that the project needs to be more tolerant of the effect (slow?) connections have on traffic peaks; especially in a production environment. ID: 3175 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0	Message 3178 - Posted: 2 May 2016, 20:24:00 UTC - in response to Message 3175. From now on. Please bear in mind that some of these workloads are resource intensive, which is one of the reasons this was not done 10 years ago. Yes, the system should be robust but there will be a certain performance threshold that is required. ID: 3178 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3180 - Posted: 2 May 2016, 20:39:52 UTC - in response to Message 3178. From now on. OK, thanks. ID: 3180 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0	Message 3181 - Posted: 2 May 2016, 20:56:27 UTC - in response to Message 3120. I think there were some issues on Saturday but I see similar issues from today. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168949 2016-05-02 05:39:36 (826): Guest Log: [INFO] CMS application starting. Check log files. 2016-05-02 05:39:43 (826): Guest Log: [INFO] Condor exited with 0 What does it say in the MasterLog and StartLog? ID: 3181 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3182 - Posted: 2 May 2016, 21:58:54 UTC - in response to Message 3181. I think there were some issues on Saturday but I see similar issues from today. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168949 2016-05-02 05:39:36 (826): Guest Log: [INFO] CMS application starting. Check log files. 2016-05-02 05:39:43 (826): Guest Log: [INFO] Condor exited with 0 What does it say in the MasterLog and StartLog? MasterLog contains (along with much else) 05/02/16 18:55:13 Master restart (GRACEFUL) is watching /usr/sbin/condor_master (mtime:1391288830) 05/02/16 18:55:13 Started DaemonCore process "/usr/sbin/condor_startd", pid and pgroup = 4173 05/02/16 18:55:26 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 05/02/16 19:51:58 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 05/02/16 19:52:03 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 05/02/16 19:55:13 Preen pid is 4982 05/02/16 19:55:14 DefaultReaper unexpectedly called on pid 4982, status 0. 05/02/16 20:24:13 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 05/02/16 20:24:13 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 05/02/16 20:24:23 condor_write(): Socket closed when trying to write 1023 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 10 and the similar period from StartLog:- 05/02/16 18:56:26 Got activate_claim request from shadow (130.246.180.120) 05/02/16 18:56:26 Remote job ID is 1118866.0 05/02/16 18:56:26 Got universe "VANILLA" (5) from request classad 05/02/16 18:56:26 State change: claim-activation protocol successful 05/02/16 18:56:26 Changing activity: Idle -> Busy 05/02/16 18:56:30 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: DAEMON authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15 05/02/16 19:51:58 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 8 05/02/16 19:51:58 Buf::write(): condor_write() failed 05/02/16 19:51:58 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 1 of 3): CEDAR:6002:failed to send EOM 05/02/16 19:58:22 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 05/02/16 19:58:27 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 05/02/16 20:24:13 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 05/02/16 20:24:13 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 05/02/16 20:24:31 condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9623, fd is 9 05/02/16 20:24:31 Buf::write(): condor_write() failed 05/02/16 20:25:14 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#50623 05/02/16 20:48:30 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9 05/02/16 20:48:30 Buf::write(): condor_write() failed 05/02/16 20:48:30 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 1 of 3): CEDAR:6002:failed to send EOM 05/02/16 20:48:35 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9 05/02/16 20:48:35 Buf::write(): condor_write() failed 05/02/16 20:48:35 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 2 of 3): CEDAR:6002:failed to send EOM\|CEDAR:6002:failed to send EOM 05/02/16 20:48:40 condor_write(): Socket closed when trying to write 53 bytes to daemon at <10.0.2.15:58279>, fd is 9 05/02/16 20:48:40 Buf::write(): condor_write() failed 05/02/16 20:48:40 ChildAliveMsg: failed to send DC_CHILDALIVE to parent daemon at <10.0.2.15:58279> (try 3 of 3): CEDAR:6002:failed to send EOM\|CEDAR:6002:failed to send EOM\|CEDAR:6002:failed to send EOM 05/02/16 21:00:14 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason 05/02/16 21:00:19 PERMISSION DENIED to condor@178-1031-8726 from host 10.0.2.15 for command 60008 (DC_CHILDALIVE), access level DAEMON: reason: cached result for DAEMON; see first case for the full reason Is this what you mean? ID: 3182 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3293 - Posted: 9 May 2016, 11:20:17 UTC Last modified: 9 May 2016, 11:23:58 UTC I have noticed, that when a task ends, it does not allways get a good tasks immediatly. There can be one or two 7 min tasks, before it starts a normal one. Jobs should have been present,so there was no reason, not to start proplerly. (I had this happen twice) I would also like to have the Job Ids back in the stderr. The Condor Ids are not much use on dashboard. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=172743 There is still no usful exit reason in the log. ID: 3293 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3485 - Posted: 24 May 2016, 13:25:39 UTC It seems, that if a waiting task is suspended, and resumed and eventually starts, it will end after a few minutes. More than that, any task, that is suspened while not fully booted to actually run a job, will end after 6 to 7 minutes. ID: 3485 · Rating: 0 · rate: / Reply Quote

Development for LHC@home