Message boards :
Number crunching :
Current issues
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
Let's start a new thread, the old one's begun rambling a bit... I've convinced myself that jobs get lost once BOINC is shut down for any reason. So, I've played with JobLeaseDuration, the time Condor allows before deeming a job to be lost (at least that's my understanding...) Set it to 600 seconds, and it seems to have an effect on the "jobs in progress" graph. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I suspended the cms task for 6 min. The vm reports cms virtual machine as "aborted", but is then started. On resume logs are reset and everything starts at run-1 again. New glidein. Event number and lumisection number are the same,as with the original job, just starting at the beginning. I did not check, if the job number is the same, but even if, the work done on the original job seems lost, as it starts from scratch again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
Interesting, thanks. I'm still not sure I exactly understand the behaviour, but it's certainly different to what I'd originally thought. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Just disconnecting the internet for 8 min as no effect.Keeps running. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 20 |
I suspended the cms task for 6 min. This is a BOINC issue. With your BOINC version (7.6.9) the VM should save within 15 seconds after a suspend with 'Leave application in memory' off. Install the newest recommended BOINC version 7.6.22 and the VM get 60 seconds time to save, what normally is enough, when you do not save a lot of VM's at the same time. CP |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events. How many events should be in each job? The task name shows "200ev" but condor is set for 300. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 20 |
I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events. Read message: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=79&postid=1572#1572 |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Ah... yes. Now I remember... Thanks, CP. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I interrupted the internet connection for 11 min. After reconnecting glidein was run over and over again every 7 minutes or so. I suspended the cms-task for a few minutes and after resume, it continued doing so,generating run-4, run-5 etc. Anybody not watching the logs would not notice that. Logs are available, if needed. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
I interrupted the internet connection for 11 min. You seem to have recovered, one way or another: 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6499.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 00:35:40 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6580.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 02:09:36 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6581.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 03:51:02 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6663.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 05:38:35 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6705.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 07:05:25 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6751.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 08:49:40 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 13:13:27 GMT 2016 on 277-617-21343 with (short) status 0 ======== [cms005@lcggwms02:~] > head 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt ======== gWMS-CMSRunAnalysis.sh STARTING at Mon Jan 18 11:43:13 GMT 2016 on 277-617-21343 ======== Local time : Mon Jan 18 12:43:13 CET 2016 Current system : Linux 277-617-21343 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux What version of BOINC are you using? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am using 7.6.22, the latest. I interrupted the internet for 11min until 9.49UTC. I stopped the cms-task eventually and started a new one, as the old kept creating new run-x dirs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
OK, thanks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I repeated the test. 15 min no internet connection. This time it just started a new run with a new job. I guess, that is what is supposed to happen. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
FYI: I found 6 jobs, that have not been run but are labeled as finished. 7750 7821 7942 8831 8885 8925 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
I repeated the test. Yes, I'd believed otherwise, but lately I've been informed that it will lose the job. We will apparently be depending on it at Point 5 when beam starts up, as we will use the vast high-level trigger (HLT) server farm for short jobs (~2 hrs) in between proton collision conditions ("inter-fill") which will be dropped when the farm is needed for the trigger again. I'm told Condor reschedules them at the front of the queue. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, the job is just taken over by somebody else and finished. Not failed, just the IP address is kept. Do you need logs of 60311 failures? Or, alternatively i have a job, the just stopped at event115. However, it is probably better to concentrate on the dominant error, not the odd-ball. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
In fact, they did all finish, and returned result files. However, they were all done on the same machine from one of our friends in SETI.Germany. I don't immediately see anything amiss with his logs, but somehow Dashboard isn't parsing his returns properly, I guess. Hmm, Dashboard did properly record details of other jobs from the same host -- another Dashboard mystery, I guess. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Never mind, then. Just thought this was odd. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
Oops, number of running jobs has fallen. Investigating... === 1615: The glide-ins aren't gliding in... 1630: I guess everyone is seeing this in their cron-stdout file: 15:43:01 +0000 2016-01-26 [INFO] Starting CMS Application - Run 4 15:43:01 +0000 2016-01-26 [INFO] Reading the BOINC volunteer's information 15:43:03 +0000 2016-01-26 [INFO] Volunteer: ivan (9) Host: 22 15:43:03 +0000 2016-01-26 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90 15:43:03 +0000 2016-01-26 [INFO] Requesting an X509 credential from CMS-Dev 15:43:05 +0000 2016-01-26 [INFO] Requesting an X509 credential from LHC@home curl -s -u 9:d9a4d5ccdeaae49eff789ed77c04b53c --capath /etc/grid-security/certificates/ https://cms-data-bridge.cern.ch/boinc-auth/get_proxy -o /tmp/x509up_u500 15:43:06 +0000 2016-01-26 [INFO] Cloud not get a proxy from CMS-Dev 15:43:06 +0000 2016-01-26 [INFO] Going to sleep for 1 hour Messages sent. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,182,521 RAC: 2,043 |
Looks like we may be up again: 08:31:08 +0000 2016-01-27 [INFO] Cloud not get a proxy from CMS-Dev 08:31:08 +0000 2016-01-27 [INFO] Going to sleep for 1 hour 09:33:01 +0000 2016-01-27 [INFO] Starting CMS Application - Run 8 09:33:01 +0000 2016-01-27 [INFO] Reading the BOINC volunteer's information 09:33:03 +0000 2016-01-27 [INFO] Volunteer: ivan (9) Host: 22 09:33:03 +0000 2016-01-27 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90 09:33:03 +0000 2016-01-27 [INFO] Requesting an X509 credential from CMS-Dev subject : /O=Volunteer Computing/O=CERN/CN=ivan 9/CN=1325983332 issuer : /O=Volunteer Computing/O=CERN/CN=ivan 9 identity : /O=Volunteer Computing/O=CERN/CN=ivan 9 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:59:56 (5.4 days) 09:33:08 +0000 2016-01-27 [INFO] Downloading glidein 09:33:08 +0000 2016-01-27 [INFO] Running glidein (check logs) A current task should pick up a new job once its one-hour pause is over. That's the theory, anyway... Thanks, Laurence. |
©2024 CERN