Message boards : Number crunching : Houston, we have a problem
Message board moderation

To post messages, you must log in.

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 977 - Posted: 3 Sep 2015, 8:00:47 UTC
Last modified: 3 Sep 2015, 8:47:51 UTC

Hi Ivan, Laurence,

The problem is:

Every time a job is paused, cause BOINC/user is pausing/suspending the BOINC-task, the current VM-job will not finish completely.
It will be killed by condor_master, because of the (too) short timeouts.
I don't know how long pauses are accepted, but IMO they are a bit short. Lesser than an hour, I think.
It's the nature of BOINC that tasks are paused automatically:
- Another project has high priority tasks
- Another project has had too less cpu-time comparing the resource shares
- BOINC pauses at all, cause the user have a preference "pausing when system in use by a human being"
- BOINC pauses cause an executable starts running from the exclusion list.
- etc..

It doesn't matter, whether BOINC keeps the tasks in memory or not.
So you've to consider longer timeouts or accept a number of incomplete jobs.

CP

The system clocked jumped 4425 seconds unexpectedly. Restarting all daemons
ID: 977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 979 - Posted: 3 Sep 2015, 9:46:00 UTC - in response to Message 977.  
Last modified: 3 Sep 2015, 9:59:49 UTC

Hmm, OK, I've set the timeout to 7200 seconds -- it was 1200 by default. I can try increasing it for the next batch. Hmm, I think I can set it on the fly. [checks] Actually, it looks like I applied the patch too early, or some other bug, some jobs had a Lease of 1200 -- I've increased them all to 7200. I hope this condor_qedit patch doesn't apply just to the 1000 jobs currently in the active queue!** I'll have to remember to keep an eye on that in future.
** Gah! It looks like it does. I'll have to set up a cron job. :-(
ID: 979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 983 - Posted: 3 Sep 2015, 11:52:51 UTC

Thanks for the clarification, Ivan.

I didn't expect to loose jobs during the '24hrs' run of 1 BOINC-task, only 1 at the very end, when BOINC is destroying the Virtual Machine.

The 'save'state of the VM, when a CMS-task is suspended is not as 'save' as one would expect, cause the saved cmsRun could be killed by its Master after VM-restore.
It's good that you've increased the allowed timeout to reduce the loss, but BOINC-users should be aware that this project is not like VirtualLHC@home.

Also here shorter BOINC-tasks with e.g. 1 day deadline and shorter CMS-jobs would reduce the loss,
but is conflicting with the huge amount of MB's to download at each VM-creation and 1st cmsRun.
ID: 983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 985 - Posted: 3 Sep 2015, 13:40:06 UTC - in response to Message 983.  

Yes, there are problems with jobs staying out of touch too long and timing out, it's an inherently different model to most other BOINC projects. However, we're heartened by the efficiency we get even so.
ID: 985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 75
Message 988 - Posted: 3 Sep 2015, 20:35:26 UTC

What efficiency are you getting? How are you measuring this? It would be
nice, but maybe not practicable, to have access to the numbers of good
jobs/bad jobs/lost jobs on an individual host basis.

It rather looks as though this project really needs hosts to be running
continuously and swap projects more often than (now) every two hours. The
default for this is every hour but can be set much higher; five hours here.
Not sure about the last bit, think about it.

Many (most, probably) will be shut down overnight (overday for me) so the
daily loss of the running jobs together with the protracted start up process
surely represents a significant waste. Seems about as far from the original
philosophy behind BOINC as you can get... but that's for another time.
ID: 988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 990 - Posted: 4 Sep 2015, 7:53:10 UTC - in response to Message 979.  
Last modified: 4 Sep 2015, 7:57:43 UTC

Hmm, OK, I've set the timeout to 7200 seconds -- it was 1200 by default. I can try increasing it for the next batch. Hmm, I think I can set it on the fly. [checks] Actually, it looks like I applied the patch too early, or some other bug, some jobs had a Lease of 1200 -- I've increased them all to 7200. I hope this condor_qedit patch doesn't apply just to the 1000 jobs currently in the active queue!** I'll have to remember to keep an eye on that in future.
** Gah! It looks like it does. I'll have to set up a cron job. :-(

It doesn't seem to work.

BoincLog:
5276 CMS-dev 04 Sep 09:05:14 task CMS_6300_1427806802.579772_0 suspended by user
5282 CMS-dev 04 Sep 09:22:00 task CMS_6300_1427806802.579772_0 resumed by user


After BOINC has a thread available and resumed the CMS-task:

MasterLog:
09/04/15 09:28:37 (pid:7599) The system clocked jumped 1466 seconds unexpectedly. Restarting all daemons
09/04/15 09:28:37 (pid:7599) Sent SIGTERM to STARTD (pid 7602)
09/04/15 09:28:38 (pid:7599) CCBListener: no activity from CCB server in 1567s; assuming connection is dead.
09/04/15 09:28:38 (pid:7599) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9620 failed; will try to reconnect in 60 seconds.
09/04/15 09:29:38 (pid:7599) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#60215
09/04/15 09:30:37 (pid:7599) Timeout for graceful shutdown has expired for STARTD.
09/04/15 09:30:37 (pid:7599) Sent SIGQUIT to STARTD (pid 7602)
09/04/15 09:30:37 (pid:7599) AllReaper unexpectedly called on pid 7602, status 0.
09/04/15 09:30:37 (pid:7599) The STARTD (pid 7602) exited with status 0
09/04/15 09:30:37 (pid:7599) All daemons are gone. Restarting.
09/04/15 09:30:37 (pid:7599) Restarting master right away.


The end of cmsRun-stdout.log:

Begin processing the 1st record. Run 1, Event 111176, LumiSection 4448 at 04-Sep-2015 08:58:16.856 CEST
G4Fragment::CalculateExcitationEnergy(): WARNING
Fragment: A = 26, Z = 12, U = -1.360e+00 MeV IsStable= 1
P = (6.039e+01,2.546e+02,-3.426e+01) MeV E = 2.420e+04 MeV

Begin processing the 2nd record. Run 1, Event 111177, LumiSection 4448 at 04-Sep-2015 08:59:43.771 CEST
Begin processing the 3rd record. Run 1, Event 111178, LumiSection 4448 at 04-Sep-2015 09:02:06.471 CEST
Begin processing the 4th record. Run 1, Event 111179, LumiSection 4448 at 04-Sep-2015 09:29:01.689 CEST


Thereafter a new cmsRun started, logging into a new directory.
ID: 990 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 992 - Posted: 4 Sep 2015, 8:39:55 UTC - in response to Message 990.  

I'll ask the team for ideas on this one.
ID: 992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 994 - Posted: 4 Sep 2015, 9:59:49 UTC

It seems to be different from Position to Position.

I had to suspend CMS during a run, lets say at Event 8 and it went on with Event 9 (and following) when I resumed CMS

Just checked, the Suspend was for 75 minutes
ID: 994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 995 - Posted: 4 Sep 2015, 10:44:20 UTC - in response to Message 992.  

I'll ask the team for ideas on this one.

Andrew replies:
It appears that the clock jump (caused by suspending the VM I guess)
triggers the condor_master to restart all daemons, including the startd.
This is confirmed by the source code, and it seems that there's no way to
stop this. When a startd is restarted in this way you lose your job (it
doesn't do a peaceful or even graceful restart). It's not obvious to me that
there's any way around this. This is probably a question for the
htcondor-users mailing list...
So I guess that's just another caveat that we have to make our volunteers aware of.
ID: 995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 996 - Posted: 4 Sep 2015, 11:36:28 UTC

My last example was with the VM-state paused, so not 'savestate'.
The paused state is the result of suspending a task in BOINC, when the user has 'Leave applications in memory while suspended' (LAIM) ticked.
When LAIM not ticked the VM gets the savestate when the CMS-task is suspended by BOINC.
In my OP the VM had the saved state and I exceeded the default timeout where it still was 1200 seconds.

I'll retry suspending the task in BOINC with LAIM off this time and resume the task between 1200 seconds (old timeout) and 7200 seconds (your latest setting).
ID: 996 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 997 - Posted: 4 Sep 2015, 12:56:27 UTC - in response to Message 996.  
Last modified: 4 Sep 2015, 13:00:19 UTC

I'll retry suspending the task in BOINC with LAIM off this time and resume the task between 1200 seconds (old timeout) and 7200 seconds (your latest setting).

'Paused' state or 'Saved' state of the VM makes no difference. 1 job lost.

09/04/15 14:42:42 (pid:28326) The system clocked jumped 2000 seconds unexpectedly. Restarting all daemons

condor_master had closed 1 eye, cause 6th record was processed after the resume:

Begin processing the 5th record. Run 1, Event 121730, LumiSection 4870 at 04-Sep-2015 14:08:24.678 CEST
Begin processing the 6th record. Run 1, Event 121731, LumiSection 4870 at 04-Sep-2015 14:44:19.051 CEST
ID: 997 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Hendrik
Project developer
Project tester
Avatar

Send message
Joined: 1 Aug 14
Posts: 14
Credit: 884
RAC: 0
Message 1019 - Posted: 6 Sep 2015, 14:50:25 UTC - in response to Message 995.  
Last modified: 6 Sep 2015, 15:03:08 UTC


Andrew replies:
It appears that the clock jump (caused by suspending the VM I guess)
triggers the condor_master to restart all daemons, including the startd.
This is confirmed by the source code, and it seems that there's no way to
stop this. When a startd is restarted in this way you lose your job (it
doesn't do a peaceful or even graceful restart). It's not obvious to me that
there's any way around this. This is probably a question for the
htcondor-users mailing list...
So I guess that's just another caveat that we have to make our volunteers aware of.



One could try to prevent the clock jumps by disabling timesyncing in virtualbox (host -> guest bios) and in the cernVM image (guest <-> internet).
At the same time this approach would put the condor server (real time) and the job running in the VM (behind real time) significantly out of time sync. I am not sure whether condor is able to deal with this.

This Post might be of interest on the vbox side of time syncing: https://forums.virtualbox.org/viewtopic.php?t=8535#p152906
[Edit]This tutorial is more datailed, but a bit windows specific: http://stevenormrod.com/2012/10/disabling-time-sync-in-virtualbox/[/Edit]
ID: 1019 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 1961 - Posted: 11 Feb 2016, 19:54:58 UTC

It's on Laurence's ToDo-list (Workplan), but pausing the task in BOINC for only 15 minutes is enough to destroy a job.


02/11/16 18:36:19 (pid:19530) Running job as user (null)
02/11/16 18:36:20 (pid:19530) Create_Process succeeded, pid=19537

    CMS-dev 11 Feb 18:51:03 task wu_1455118210_150_0 suspended by user
    CMS-dev 11 Feb 19:06:19 task wu_1455118210_150_0 resumed by user

02/11/16 19:06:33 (pid:19530) CCBListener: no activity from CCB server in 1816s; assuming connection is dead.
02/11/16 19:06:33 (pid:19530) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9621 failed; will try to reconnect in 60 seconds.
02/11/16 19:07:34 (pid:19530) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9621 as ccbid 130.246.180.120:9621#108186
02/11/16 19:08:33 (pid:19530) Got SIGQUIT. Performing fast shutdown.
02/11/16 19:08:33 (pid:19530) ShutdownFast all jobs.
02/11/16 19:08:33 (pid:19530) Process exited, pid=19537, signal=9
02/11/16 19:08:33 (pid:19530) Last process exited, now Starter is exiting
02/11/16 19:08:33 (pid:19530) **** condor_starter (condor_STARTER) pid 19530 EXITING WITH STATUS 0

ID: 1961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 1962 - Posted: 11 Feb 2016, 21:58:32 UTC - in response to Message 1961.  

Thanks for the great testing and report. I hope we can resolve this long outstanding issue very soon.
ID: 1962 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1963 - Posted: 11 Feb 2016, 22:10:01 UTC

I think "M" is right:

... Seems about as far from the original
philosophy behind BOINC as you can get.


The boinc philosophy is the exact opposite to the cern system.

Therefore it will be impossible to fulfill all requirements for boinc tasks.

If jobs are lost, because the cms task is suspended for too long, then so be it.
The volunteer needs to be made aware of the "special needs" of this project.

What is a reasonable timeout? 20min, 20h, 20 days?
ID: 1963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 1964 - Posted: 11 Feb 2016, 22:53:40 UTC - in response to Message 1963.  

It is not really about one philosophy vs another, there are many different facets.

The technical detail here is that by default Condor has the concept of a job lease which times-out. The solution is to set this to a higher value, maybe even infinity. What make sense? By default each BOINC task has a report deadline and for Seti@home this seems to be 1 month. So the philosophy of deadlines for jobs sent out seems to be consistent.

The real question is a quality of service issue and what is the required turnaround time for job in order to do the science. If that task/job is nolonger going to have any value as it is will finish too late, what should we do? Continue processing and knowingly waste CPU time or terminate it early?

One thing to consider is the approach to validation. Many projects run tasks multiple times and compare results, so if it takes a while for the third or a fourth result to come in, no big deal. With the HEP applications, it is not possible to follow such an approach so only one job is run and statistical validations are done on collections, which means you need all job results in the collection.

Your final question is totally correct though? What is a reasonable time out?
ID: 1964 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1965 - Posted: 11 Feb 2016, 23:12:50 UTC - in response to Message 1964.  
Last modified: 11 Feb 2016, 23:25:18 UTC

We cannot guess how long user caused timeouts are.
The standard switching between tasks is 1h.Assuming an average user runs 3 projects.Which means a timeout of slightly over 2h must be somewhat reasonable.

These are estimates. So if anybody has a better suggestion, please post.

Keep in mind,we obviously cannot accommodate every possible scenario.
(I prefer to leave it at 20min, but i am not representative)

EDIT: There is a difference between deadline and timeout.
Timeout means, the task needs to contact the server within that time.
Deadline means, the task has to be finished and reported by that time.
ID: 1965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1966 - Posted: 11 Feb 2016, 23:38:39 UTC - in response to Message 977.  

According to a Condor expert a few weeks ago, this happens with CMS jobs. I have to take his word for that, I'm not a Condor expert. He also said such jobs went back to the front of the queue.
I don't know how much our CERN guys can say to confirm or deny this, but I do know it's on Laurence's list of things to do.
(Yes, I got caught out by it today too...)
ID: 1966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,945,813
RAC: 2,949
Message 1967 - Posted: 11 Feb 2016, 23:52:51 UTC - in response to Message 1965.  

I currently have JobLeaseDuration set to 2 hours, but my observations are that it doesn't affect the problem at hand. When I believed it had an impact, I was persuaded (by Yeti IIRC) to increase it, so I set it to 12 hours.
After the information that a job gets aborted anyway, I dropped it back down to 20 minutes, which is the Condor default -- some time ago Andrew had increased the limit on our server to 2 hours, before I decided to try 12. The effect of decreasing it to 20 minutes was to decrease the number of "jobs in progress". I interpreted that to mean that the job was lost once the VM shut down, but Condor didn't notice until JobLeaseDuration expired. Meanwhile, the VM had abandoned the last saved state and started afresh.
Then I increased JLD to the current two hours and saw a modest increase in the number of jobs-in-progress.
So, I think the problem is more the VM starting afresh rather than restarting from a saved state, and not the Condor lease timeout. Perhaps I should contact Dirk to confirm this, unless Laurence already knows the answer.
ID: 1967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1968 - Posted: 12 Feb 2016, 0:06:35 UTC - in response to Message 1967.  
Last modified: 12 Feb 2016, 0:12:01 UTC

Is there a way for the VM to wake up, every once in a while, contacting the server to say:"I am still going" and then going back to sleep, until boinc wakes it up again to continue crunching?
edit: it is like a heartbeat,just between vm and server.
ID: 1968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Houston, we have a problem


©2024 CERN