Thread 'Current issues'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1608 - Posted: 17 Jan 2016, 12:49:34 UTC Let's start a new thread, the old one's begun rambling a bit... I've convinced myself that jobs get lost once BOINC is shut down for any reason. So, I've played with JobLeaseDuration, the time Condor allows before deeming a job to be lost (at least that's my understanding...) Set it to 600 seconds, and it seems to have an effect on the "jobs in progress" graph. ID: 1608 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1609 - Posted: 17 Jan 2016, 14:21:49 UTC I suspended the cms task for 6 min. The vm reports cms virtual machine as "aborted", but is then started. On resume logs are reset and everything starts at run-1 again. New glidein. Event number and lumisection number are the same,as with the original job, just starting at the beginning. I did not check, if the job number is the same, but even if, the work done on the original job seems lost, as it starts from scratch again. ID: 1609 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1610 - Posted: 17 Jan 2016, 15:18:25 UTC - in response to Message 1609. Interesting, thanks. I'm still not sure I exactly understand the behaviour, but it's certainly different to what I'd originally thought. ID: 1610 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1611 - Posted: 17 Jan 2016, 15:24:32 UTC - in response to Message 1610. Just disconnecting the internet for 8 min as no effect.Keeps running. ID: 1611 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 69	Message 1612 - Posted: 17 Jan 2016, 18:57:53 UTC - in response to Message 1609. I suspended the cms task for 6 min. The vm reports cms virtual machine as "aborted", but is then started. On resume logs are reset and everything starts at run-1 again. New glidein. Event number and lumisection number are the same,as with the original job, just starting at the beginning. I did not check, if the job number is the same, but even if, the work done on the original job seems lost, as it starts from scratch again. This is a BOINC issue. With your BOINC version (7.6.9) the VM should save within 15 seconds after a suspend with 'Leave application in memory' off. Install the newest recommended BOINC version 7.6.22 and the VM get 60 seconds time to save, what normally is enough, when you do not save a lot of VM's at the same time. CP ID: 1612 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1613 - Posted: 18 Jan 2016, 2:22:37 UTC Last modified: 18 Jan 2016, 2:24:51 UTC I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events. How many events should be in each job? The task name shows "200ev" but condor is set for 300. ID: 1613 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,438 RAC: 69	Message 1614 - Posted: 18 Jan 2016, 8:49:45 UTC - in response to Message 1613. I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events. How many events should be in each job? The task name shows "200ev" but condor is set for 300. Read message: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=79&postid=1572#1572 ID: 1614 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1615 - Posted: 18 Jan 2016, 10:00:48 UTC - in response to Message 1614. ... How many events should be in each job? The task name shows "200ev" but condor is set for 300. Read message: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=79&postid=1572#1572 Ah... yes. Now I remember... Thanks, CP. ID: 1615 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1616 - Posted: 18 Jan 2016, 10:40:29 UTC Last modified: 18 Jan 2016, 10:42:13 UTC I interrupted the internet connection for 11 min. After reconnecting glidein was run over and over again every 7 minutes or so. I suspended the cms-task for a few minutes and after resume, it continued doing so,generating run-4, run-5 etc. Anybody not watching the logs would not notice that. Logs are available, if needed. ID: 1616 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1617 - Posted: 18 Jan 2016, 14:18:07 UTC - in response to Message 1616. I interrupted the internet connection for 11 min. After reconnecting glidein was run over and over again every 7 minutes or so. I suspended the cms-task for a few minutes and after resume, it continued doing so,generating run-4, run-5 etc. Anybody not watching the logs would not notice that. Logs are available, if needed. You seem to have recovered, one way or another: 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6499.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 00:35:40 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6580.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 02:09:36 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6581.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 03:51:02 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6663.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 05:38:35 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6705.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 07:05:25 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6751.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 08:49:40 GMT 2016 on 277-617-29249 with (short) status 0 ======== 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 13:13:27 GMT 2016 on 277-617-21343 with (short) status 0 ======== [cms005@lcggwms02:~] > head 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt ======== gWMS-CMSRunAnalysis.sh STARTING at Mon Jan 18 11:43:13 GMT 2016 on 277-617-21343 ======== Local time : Mon Jan 18 12:43:13 CET 2016 Current system : Linux 277-617-21343 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux What version of BOINC are you using? ID: 1617 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1618 - Posted: 18 Jan 2016, 16:03:24 UTC - in response to Message 1617. I am using 7.6.22, the latest. I interrupted the internet for 11min until 9.49UTC. I stopped the cms-task eventually and started a new one, as the old kept creating new run-x dirs. ID: 1618 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1619 - Posted: 18 Jan 2016, 16:32:24 UTC - in response to Message 1618. OK, thanks. ID: 1619 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1620 - Posted: 18 Jan 2016, 17:30:59 UTC I repeated the test. 15 min no internet connection. This time it just started a new run with a new job. I guess, that is what is supposed to happen. ID: 1620 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1628 - Posted: 25 Jan 2016, 12:14:35 UTC FYI: I found 6 jobs, that have not been run but are labeled as finished. 7750 7821 7942 8831 8885 8925 ID: 1628 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1629 - Posted: 25 Jan 2016, 14:37:35 UTC - in response to Message 1620. I repeated the test. 15 min no internet connection. This time it just started a new run with a new job. I guess, that is what is supposed to happen. Yes, I'd believed otherwise, but lately I've been informed that it will lose the job. We will apparently be depending on it at Point 5 when beam starts up, as we will use the vast high-level trigger (HLT) server farm for short jobs (~2 hrs) in between proton collision conditions ("inter-fill") which will be dropped when the farm is needed for the trigger again. I'm told Condor reschedules them at the front of the queue. ID: 1629 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1630 - Posted: 25 Jan 2016, 15:00:34 UTC - in response to Message 1629. Last modified: 25 Jan 2016, 15:01:06 UTC Yes, the job is just taken over by somebody else and finished. Not failed, just the IP address is kept. Do you need logs of 60311 failures? Or, alternatively i have a job, the just stopped at event115. However, it is probably better to concentrate on the dominant error, not the odd-ball. ID: 1630 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1631 - Posted: 25 Jan 2016, 15:18:55 UTC - in response to Message 1628. In fact, they did all finish, and returned result files. However, they were all done on the same machine from one of our friends in SETI.Germany. I don't immediately see anything amiss with his logs, but somehow Dashboard isn't parsing his returns properly, I guess. Hmm, Dashboard did properly record details of other jobs from the same host -- another Dashboard mystery, I guess. ID: 1631 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1632 - Posted: 25 Jan 2016, 15:20:34 UTC - in response to Message 1631. Never mind, then. Just thought this was odd. ID: 1632 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1644 - Posted: 26 Jan 2016, 15:41:50 UTC Last modified: 26 Jan 2016, 16:30:31 UTC Oops, number of running jobs has fallen. Investigating... === 1615: The glide-ins aren't gliding in... 1630: I guess everyone is seeing this in their cron-stdout file: 15:43:01 +0000 2016-01-26 [INFO] Starting CMS Application - Run 4 15:43:01 +0000 2016-01-26 [INFO] Reading the BOINC volunteer's information 15:43:03 +0000 2016-01-26 [INFO] Volunteer: ivan (9) Host: 22 15:43:03 +0000 2016-01-26 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90 15:43:03 +0000 2016-01-26 [INFO] Requesting an X509 credential from CMS-Dev 15:43:05 +0000 2016-01-26 [INFO] Requesting an X509 credential from LHC@home curl -s -u 9:d9a4d5ccdeaae49eff789ed77c04b53c --capath /etc/grid-security/certificates/ https://cms-data-bridge.cern.ch/boinc-auth/get_proxy -o /tmp/x509up_u500 15:43:06 +0000 2016-01-26 [INFO] Cloud not get a proxy from CMS-Dev 15:43:06 +0000 2016-01-26 [INFO] Going to sleep for 1 hour Messages sent. ID: 1644 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 7	Message 1646 - Posted: 27 Jan 2016, 9:41:42 UTC - in response to Message 1644. Looks like we may be up again: 08:31:08 +0000 2016-01-27 [INFO] Cloud not get a proxy from CMS-Dev 08:31:08 +0000 2016-01-27 [INFO] Going to sleep for 1 hour 09:33:01 +0000 2016-01-27 [INFO] Starting CMS Application - Run 8 09:33:01 +0000 2016-01-27 [INFO] Reading the BOINC volunteer's information 09:33:03 +0000 2016-01-27 [INFO] Volunteer: ivan (9) Host: 22 09:33:03 +0000 2016-01-27 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90 09:33:03 +0000 2016-01-27 [INFO] Requesting an X509 credential from CMS-Dev subject : /O=Volunteer Computing/O=CERN/CN=ivan 9/CN=1325983332 issuer : /O=Volunteer Computing/O=CERN/CN=ivan 9 identity : /O=Volunteer Computing/O=CERN/CN=ivan 9 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:59:56 (5.4 days) 09:33:08 +0000 2016-01-27 [INFO] Downloading glidein 09:33:08 +0000 2016-01-27 [INFO] Running glidein (check logs) A current task should pick up a new job once its one-hour pause is over. That's the theory, anyway... Thanks, Laurence. ID: 1646 · Rating: 0 · rate: / Reply Quote

Development for LHC@home