Thread 'Houston, we have a problem'

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 977 - Posted: 3 Sep 2015, 8:00:47 UTC Last modified: 3 Sep 2015, 8:47:51 UTC Hi Ivan, Laurence, The problem is: Every time a job is paused, cause BOINC/user is pausing/suspending the BOINC-task, the current VM-job will not finish completely. It will be killed by condor_master, because of the (too) short timeouts. I don't know how long pauses are accepted, but IMO they are a bit short. Lesser than an hour, I think. It's the nature of BOINC that tasks are paused automatically: - Another project has high priority tasks - Another project has had too less cpu-time comparing the resource shares - BOINC pauses at all, cause the user have a preference "pausing when system in use by a human being" - BOINC pauses cause an executable starts running from the exclusion list. - etc.. It doesn't matter, whether BOINC keeps the tasks in memory or not. So you've to consider longer timeouts or accept a number of incomplete jobs. CP The system clocked jumped 4425 seconds unexpectedly. Restarting all daemons ID: 977 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 979 - Posted: 3 Sep 2015, 9:46:00 UTC - in response to Message 977. Last modified: 3 Sep 2015, 9:59:49 UTC Hmm, OK, I've set the timeout to 7200 seconds -- it was 1200 by default. I can try increasing it for the next batch. Hmm, I think I can set it on the fly. [checks] Actually, it looks like I applied the patch too early, or some other bug, some jobs had a Lease of 1200 -- I've increased them all to 7200. I hope this condor_qedit patch doesn't apply just to the 1000 jobs currently in the active queue! I'll have to remember to keep an eye on that in future. Gah! It looks like it does. I'll have to set up a cron job. :-( ID: 979 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 983 - Posted: 3 Sep 2015, 11:52:51 UTC Thanks for the clarification, Ivan. I didn't expect to loose jobs during the '24hrs' run of 1 BOINC-task, only 1 at the very end, when BOINC is destroying the Virtual Machine. The 'save'state of the VM, when a CMS-task is suspended is not as 'save' as one would expect, cause the saved cmsRun could be killed by its Master after VM-restore. It's good that you've increased the allowed timeout to reduce the loss, but BOINC-users should be aware that this project is not like VirtualLHC@home. Also here shorter BOINC-tasks with e.g. 1 day deadline and shorter CMS-jobs would reduce the loss, but is conflicting with the huge amount of MB's to download at each VM-creation and 1st cmsRun. ID: 983 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 985 - Posted: 3 Sep 2015, 13:40:06 UTC - in response to Message 983. Yes, there are problems with jobs staying out of touch too long and timing out, it's an inherently different model to most other BOINC projects. However, we're heartened by the efficiency we get even so. ID: 985 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 988 - Posted: 3 Sep 2015, 20:35:26 UTC What efficiency are you getting? How are you measuring this? It would be nice, but maybe not practicable, to have access to the numbers of good jobs/bad jobs/lost jobs on an individual host basis. It rather looks as though this project really needs hosts to be running continuously and swap projects more often than (now) every two hours. The default for this is every hour but can be set much higher; five hours here. Not sure about the last bit, think about it. Many (most, probably) will be shut down overnight (overday for me) so the daily loss of the running jobs together with the protracted start up process surely represents a significant waste. Seems about as far from the original philosophy behind BOINC as you can get... but that's for another time. ID: 988 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 990 - Posted: 4 Sep 2015, 7:53:10 UTC - in response to Message 979. Last modified: 4 Sep 2015, 7:57:43 UTC Hmm, OK, I've set the timeout to 7200 seconds -- it was 1200 by default. I can try increasing it for the next batch. Hmm, I think I can set it on the fly. [checks] Actually, it looks like I applied the patch too early, or some other bug, some jobs had a Lease of 1200 -- I've increased them all to 7200. I hope this condor_qedit patch doesn't apply just to the 1000 jobs currently in the active queue! I'll have to remember to keep an eye on that in future. Gah! It looks like it does. I'll have to set up a cron job. :-( It doesn't seem to work. BoincLog: 5276 CMS-dev 04 Sep 09:05:14 task CMS_6300_1427806802.579772_0 suspended by user 5282 CMS-dev 04 Sep 09:22:00 task CMS_6300_1427806802.579772_0 resumed by user After BOINC has a thread available and resumed the CMS-task: MasterLog: 09/04/15 09:28:37 (pid:7599) The system clocked jumped 1466 seconds unexpectedly. Restarting all daemons 09/04/15 09:28:37 (pid:7599) Sent SIGTERM to STARTD (pid 7602) 09/04/15 09:28:38 (pid:7599) CCBListener: no activity from CCB server in 1567s; assuming connection is dead. 09/04/15 09:28:38 (pid:7599) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds. 09/04/15 09:29:38 (pid:7599) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#60215 09/04/15 09:30:37 (pid:7599) Timeout for graceful shutdown has expired for STARTD. 09/04/15 09:30:37 (pid:7599) Sent SIGQUIT to STARTD (pid 7602) 09/04/15 09:30:37 (pid:7599) AllReaper unexpectedly called on pid 7602, status 0. 09/04/15 09:30:37 (pid:7599) The STARTD (pid 7602) exited with status 0 09/04/15 09:30:37 (pid:7599) All daemons are gone. Restarting. 09/04/15 09:30:37 (pid:7599) Restarting master right away. The end of cmsRun-stdout.log: Begin processing the 1st record. Run 1, Event 111176, LumiSection 4448 at 04-Sep-2015 08:58:16.856 CEST G4Fragment::CalculateExcitationEnergy(): WARNING Fragment: A = 26, Z = 12, U = -1.360e+00 MeV IsStable= 1 P = (6.039e+01,2.546e+02,-3.426e+01) MeV E = 2.420e+04 MeV Begin processing the 2nd record. Run 1, Event 111177, LumiSection 4448 at 04-Sep-2015 08:59:43.771 CEST Begin processing the 3rd record. Run 1, Event 111178, LumiSection 4448 at 04-Sep-2015 09:02:06.471 CEST Begin processing the 4th record. Run 1, Event 111179, LumiSection 4448 at 04-Sep-2015 09:29:01.689 CEST Thereafter a new cmsRun started, logging into a new directory. ID: 990 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 992 - Posted: 4 Sep 2015, 8:39:55 UTC - in response to Message 990. I'll ask the team for ideas on this one. ID: 992 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 165 Credit: 3,681,653 RAC: 7,307	Message 994 - Posted: 4 Sep 2015, 9:59:49 UTC It seems to be different from Position to Position. I had to suspend CMS during a run, lets say at Event 8 and it went on with Event 9 (and following) when I resumed CMS Just checked, the Suspend was for 75 minutes ID: 994 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 995 - Posted: 4 Sep 2015, 10:44:20 UTC - in response to Message 992. I'll ask the team for ideas on this one. Andrew replies: It appears that the clock jump (caused by suspending the VM I guess) triggers the condor_master to restart all daemons, including the startd. This is confirmed by the source code, and it seems that there's no way to stop this. When a startd is restarted in this way you lose your job (it doesn't do a peaceful or even graceful restart). It's not obvious to me that there's any way around this. This is probably a question for the htcondor-users mailing list... So I guess that's just another caveat that we have to make our volunteers aware of. ID: 995 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 996 - Posted: 4 Sep 2015, 11:36:28 UTC My last example was with the VM-state paused, so not 'savestate'. The paused state is the result of suspending a task in BOINC, when the user has 'Leave applications in memory while suspended' (LAIM) ticked. When LAIM not ticked the VM gets the savestate when the CMS-task is suspended by BOINC. In my OP the VM had the saved state and I exceeded the default timeout where it still was 1200 seconds. I'll retry suspending the task in BOINC with LAIM off this time and resume the task between 1200 seconds (old timeout) and 7200 seconds (your latest setting). ID: 996 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 997 - Posted: 4 Sep 2015, 12:56:27 UTC - in response to Message 996. Last modified: 4 Sep 2015, 13:00:19 UTC I'll retry suspending the task in BOINC with LAIM off this time and resume the task between 1200 seconds (old timeout) and 7200 seconds (your latest setting). 'Paused' state or 'Saved' state of the VM makes no difference. 1 job lost. 09/04/15 14:42:42* (pid:28326) The system clocked jumped 2000 seconds unexpectedly. Restarting all daemons* condor_master had closed 1 eye, cause 6th record was processed after the resume: Begin processing the 5th record. Run 1, Event 121730, LumiSection 4870 at 04-Sep-2015 14:08:24.678 CEST Begin processing the 6th record. Run 1, Event 121731, LumiSection 4870 at 04-Sep-2015 14:44:19.051 CEST ID: 997 · Rating: 0 · rate: / Reply Quote

Hendrik Project developer Project tester Send message Joined: 1 Aug 14 Posts: 14 Credit: 884 RAC: 0	Message 1019 - Posted: 6 Sep 2015, 14:50:25 UTC - in response to Message 995. Last modified: 6 Sep 2015, 15:03:08 UTC Andrew replies: It appears that the clock jump (caused by suspending the VM I guess) triggers the condor_master to restart all daemons, including the startd. This is confirmed by the source code, and it seems that there's no way to stop this. When a startd is restarted in this way you lose your job (it doesn't do a peaceful or even graceful restart). It's not obvious to me that there's any way around this. This is probably a question for the htcondor-users mailing list... So I guess that's just another caveat that we have to make our volunteers aware of. One could try to prevent the clock jumps by disabling timesyncing in virtualbox (host -> guest bios) and in the cernVM image (guest <-> internet). At the same time this approach would put the condor server (real time) and the job running in the VM (behind real time) significantly out of time sync. I am not sure whether condor is able to deal with this. This Post might be of interest on the vbox side of time syncing: https://forums.virtualbox.org/viewtopic.php?t=8535#p152906 [Edit]This tutorial is more datailed, but a bit windows specific: http://stevenormrod.com/2012/10/disabling-time-sync-in-virtualbox/[/Edit] ID: 1019 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 61	Message 1961 - Posted: 11 Feb 2016, 19:54:58 UTC It's on Laurence's ToDo-list (Workplan), but pausing the task in BOINC for only 15 minutes is enough to destroy a job. 02/11/16 18:36:19 (pid:19530) Running job as user (null) 02/11/16 18:36:20 (pid:19530) Create_Process succeeded, pid=19537 CMS-dev 11 Feb 18:51:03 task wu_1455118210_150_0 suspended by user CMS-dev 11 Feb 19:06:19 task wu_1455118210_150_0 resumed by user 02/11/16 19:06:33 (pid:19530) CCBListener: no activity from CCB server in 1816s; assuming connection is dead. 02/11/16 19:06:33 (pid:19530) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds. 02/11/16 19:07:34 (pid:19530) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#108186 02/11/16 19:08:33 (pid:19530) Got SIGQUIT. Performing fast shutdown. 02/11/16 19:08:33 (pid:19530) ShutdownFast all jobs. 02/11/16 19:08:33 (pid:19530) Process exited, pid=19537, signal=9 02/11/16 19:08:33 (pid:19530) Last process exited, now Starter is exiting 02/11/16 19:08:33 (pid:19530) **** condor_starter (condor_STARTER) pid 19530 EXITING WITH STATUS 0 ID: 1961 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1962 - Posted: 11 Feb 2016, 21:58:32 UTC - in response to Message 1961. Thanks for the great testing and report. I hope we can resolve this long outstanding issue very soon. ID: 1962 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1963 - Posted: 11 Feb 2016, 22:10:01 UTC I think "M" is right: ... Seems about as far from the original philosophy behind BOINC as you can get. The boinc philosophy is the exact opposite to the cern system. Therefore it will be impossible to fulfill all requirements for boinc tasks. If jobs are lost, because the cms task is suspended for too long, then so be it. The volunteer needs to be made aware of the "special needs" of this project. What is a reasonable timeout? 20min, 20h, 20 days? ID: 1963 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 1964 - Posted: 11 Feb 2016, 22:53:40 UTC - in response to Message 1963. It is not really about one philosophy vs another, there are many different facets. The technical detail here is that by default Condor has the concept of a job lease which times-out. The solution is to set this to a higher value, maybe even infinity. What make sense? By default each BOINC task has a report deadline and for Seti@home this seems to be 1 month. So the philosophy of deadlines for jobs sent out seems to be consistent. The real question is a quality of service issue and what is the required turnaround time for job in order to do the science. If that task/job is nolonger going to have any value as it is will finish too late, what should we do? Continue processing and knowingly waste CPU time or terminate it early? One thing to consider is the approach to validation. Many projects run tasks multiple times and compare results, so if it takes a while for the third or a fourth result to come in, no big deal. With the HEP applications, it is not possible to follow such an approach so only one job is run and statistical validations are done on collections, which means you need all job results in the collection. Your final question is totally correct though? What is a reasonable time out? ID: 1964 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1965 - Posted: 11 Feb 2016, 23:12:50 UTC - in response to Message 1964. Last modified: 11 Feb 2016, 23:25:18 UTC We cannot guess how long user caused timeouts are. The standard switching between tasks is 1h.Assuming an average user runs 3 projects.Which means a timeout of slightly over 2h must be somewhat reasonable. These are estimates. So if anybody has a better suggestion, please post. Keep in mind,we obviously cannot accommodate every possible scenario. (I prefer to leave it at 20min, but i am not representative) EDIT: There is a difference between deadline and timeout. Timeout means, the task needs to contact the server within that time. Deadline means, the task has to be finished and reported by that time. ID: 1965 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1966 - Posted: 11 Feb 2016, 23:38:39 UTC - in response to Message 977. According to a Condor expert a few weeks ago, this happens with CMS jobs. I have to take his word for that, I'm not a Condor expert. He also said such jobs went back to the front of the queue. I don't know how much our CERN guys can say to confirm or deny this, but I do know it's on Laurence's list of things to do. (Yes, I got caught out by it today too...) ID: 1966 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1967 - Posted: 11 Feb 2016, 23:52:51 UTC - in response to Message 1965. I currently have JobLeaseDuration set to 2 hours, but my observations are that it doesn't affect the problem at hand. When I believed it had an impact, I was persuaded (by Yeti IIRC) to increase it, so I set it to 12 hours. After the information that a job gets aborted anyway, I dropped it back down to 20 minutes, which is the Condor default -- some time ago Andrew had increased the limit on our server to 2 hours, before I decided to try 12. The effect of decreasing it to 20 minutes was to decrease the number of "jobs in progress". I interpreted that to mean that the job was lost once the VM shut down, but Condor didn't notice until JobLeaseDuration expired. Meanwhile, the VM had abandoned the last saved state and started afresh. Then I increased JLD to the current two hours and saw a modest increase in the number of jobs-in-progress. So, I think the problem is more the VM starting afresh rather than restarting from a saved state, and not the Condor lease timeout. Perhaps I should contact Dirk to confirm this, unless Laurence already knows the answer. ID: 1967 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1968 - Posted: 12 Feb 2016, 0:06:35 UTC - in response to Message 1967. Last modified: 12 Feb 2016, 0:12:01 UTC Is there a way for the VM to wake up, every once in a while, contacting the server to say:"I am still going" and then going back to sleep, until boinc wakes it up again to continue crunching? edit: it is like a heartbeat,just between vm and server. ID: 1968 · Rating: 0 · rate: / Reply Quote

Development for LHC@home