Message boards : Number crunching : Current issues
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 5 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1608 - Posted: 17 Jan 2016, 12:49:34 UTC

Let's start a new thread, the old one's begun rambling a bit...

I've convinced myself that jobs get lost once BOINC is shut down for any reason. So, I've played with JobLeaseDuration, the time Condor allows before deeming a job to be lost (at least that's my understanding...) Set it to 600 seconds, and it seems to have an effect on the "jobs in progress" graph.
ID: 1608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1609 - Posted: 17 Jan 2016, 14:21:49 UTC

I suspended the cms task for 6 min.
The vm reports cms virtual machine as "aborted", but is then started.
On resume logs are reset and everything starts at run-1 again.
New glidein.

Event number and lumisection number are the same,as with the original job, just starting at the beginning.
I did not check, if the job number is the same, but even if, the work done on the original job seems lost, as it starts from scratch again.
ID: 1609 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1610 - Posted: 17 Jan 2016, 15:18:25 UTC - in response to Message 1609.  

Interesting, thanks. I'm still not sure I exactly understand the behaviour, but it's certainly different to what I'd originally thought.
ID: 1610 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1611 - Posted: 17 Jan 2016, 15:24:32 UTC - in response to Message 1610.  

Just disconnecting the internet for 8 min as no effect.Keeps running.
ID: 1611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 785,515
RAC: 362
Message 1612 - Posted: 17 Jan 2016, 18:57:53 UTC - in response to Message 1609.  

I suspended the cms task for 6 min.
The vm reports cms virtual machine as "aborted", but is then started.
On resume logs are reset and everything starts at run-1 again.
New glidein.

Event number and lumisection number are the same,as with the original job, just starting at the beginning.
I did not check, if the job number is the same, but even if, the work done on the original job seems lost, as it starts from scratch again.

This is a BOINC issue.
With your BOINC version (7.6.9) the VM should save within 15 seconds after a suspend with 'Leave application in memory' off.
Install the newest recommended BOINC version 7.6.22 and the VM get 60 seconds time to save, what normally is enough, when you do not save a lot of VM's at the same time.

CP
ID: 1612 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 867,936
RAC: 13
Message 1613 - Posted: 18 Jan 2016, 2:22:37 UTC
Last modified: 18 Jan 2016, 2:24:51 UTC

I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events.
How many events should be in each job? The task name shows "200ev" but condor is set for 300.
ID: 1613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1167
Credit: 785,515
RAC: 362
Message 1614 - Posted: 18 Jan 2016, 8:49:45 UTC - in response to Message 1613.  

I was just experimenting, starting and stopping BOINC for varying times when I noticed that the 24hr timeout was approaching as was 200 events. So, not keen to waste the work, waited for it to stop at 200 events. It didn't. It had got to 262 when it was killed. Job 6566. Looking at another in-progress job, it seems set for 300 events.
How many events should be in each job? The task name shows "200ev" but condor is set for 300.

Read message: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=79&postid=1572#1572
ID: 1614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 867,936
RAC: 13
Message 1615 - Posted: 18 Jan 2016, 10:00:48 UTC - in response to Message 1614.  


... How many events should be in each job? The task name shows "200ev" but condor is set for 300.

Read message: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=79&postid=1572#1572

Ah... yes. Now I remember... Thanks, CP.
ID: 1615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1616 - Posted: 18 Jan 2016, 10:40:29 UTC
Last modified: 18 Jan 2016, 10:42:13 UTC

I interrupted the internet connection for 11 min.
After reconnecting glidein was run over and over again every 7 minutes or so.
I suspended the cms-task for a few minutes and after resume, it continued doing so,generating run-4, run-5 etc.
Anybody not watching the logs would not notice that.
Logs are available, if needed.
ID: 1616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1617 - Posted: 18 Jan 2016, 14:18:07 UTC - in response to Message 1616.  

I interrupted the internet connection for 11 min.
After reconnecting glidein was run over and over again every 7 minutes or so.
I suspended the cms-task for a few minutes and after resume, it continued doing so,generating run-4, run-5 etc.
Anybody not watching the logs would not notice that.
Logs are available, if needed.

You seem to have recovered, one way or another:
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6499.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 00:35:40 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6580.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 02:09:36 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6581.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 03:51:02 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6663.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 05:38:35 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6705.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 07:05:25 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6751.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 08:49:40 GMT 2016 on 277-617-29249 with (short) status 0 ========
160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Mon Jan 18 13:13:27 GMT 2016 on 277-617-21343 with (short) status 0 ========

[cms005@lcggwms02:~] > head 160107_194556:ireid_crab_CMS_at_Home_MinBias_200ev/job_out.6873.0.txt
======== gWMS-CMSRunAnalysis.sh STARTING at Mon Jan 18 11:43:13 GMT 2016 on 277-617-21343 ========
Local time : Mon Jan 18 12:43:13 CET 2016
Current system : Linux 277-617-21343 3.10.64-85.cernvm.x86_64 #1 SMP Fri Jan 9 09:53:29 CET 2015 x86_64 x86_64 x86_64 GNU/Linux


What version of BOINC are you using?
ID: 1617 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1618 - Posted: 18 Jan 2016, 16:03:24 UTC - in response to Message 1617.  

I am using 7.6.22, the latest.
I interrupted the internet for 11min until 9.49UTC.
I stopped the cms-task eventually and started a new one, as the old kept creating new run-x dirs.
ID: 1618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1619 - Posted: 18 Jan 2016, 16:32:24 UTC - in response to Message 1618.  

OK, thanks.
ID: 1619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1620 - Posted: 18 Jan 2016, 17:30:59 UTC

I repeated the test.
15 min no internet connection.
This time it just started a new run with a new job.
I guess, that is what is supposed to happen.
ID: 1620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1628 - Posted: 25 Jan 2016, 12:14:35 UTC

FYI:

I found 6 jobs, that have not been run but are labeled as finished.
7750
7821
7942
8831
8885
8925
ID: 1628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1629 - Posted: 25 Jan 2016, 14:37:35 UTC - in response to Message 1620.  

I repeated the test.
15 min no internet connection.
This time it just started a new run with a new job.
I guess, that is what is supposed to happen.

Yes, I'd believed otherwise, but lately I've been informed that it will lose the job. We will apparently be depending on it at Point 5 when beam starts up, as we will use the vast high-level trigger (HLT) server farm for short jobs (~2 hrs) in between proton collision conditions ("inter-fill") which will be dropped when the farm is needed for the trigger again. I'm told Condor reschedules them at the front of the queue.
ID: 1629 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1630 - Posted: 25 Jan 2016, 15:00:34 UTC - in response to Message 1629.  
Last modified: 25 Jan 2016, 15:01:06 UTC

Yes, the job is just taken over by somebody else and finished.
Not failed, just the IP address is kept.

Do you need logs of 60311 failures?

Or, alternatively i have a job, the just stopped at event115.
However, it is probably better to concentrate on the dominant error, not the odd-ball.
ID: 1630 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1631 - Posted: 25 Jan 2016, 15:18:55 UTC - in response to Message 1628.  

In fact, they did all finish, and returned result files. However, they were all done on the same machine from one of our friends in SETI.Germany. I don't immediately see anything amiss with his logs, but somehow Dashboard isn't parsing his returns properly, I guess. Hmm, Dashboard did properly record details of other jobs from the same host -- another Dashboard mystery, I guess.
ID: 1631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1632 - Posted: 25 Jan 2016, 15:20:34 UTC - in response to Message 1631.  

Never mind, then.
Just thought this was odd.
ID: 1632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1644 - Posted: 26 Jan 2016, 15:41:50 UTC
Last modified: 26 Jan 2016, 16:30:31 UTC

Oops, number of running jobs has fallen. Investigating...
===
1615: The glide-ins aren't gliding in...

1630: I guess everyone is seeing this in their cron-stdout file:

15:43:01 +0000 2016-01-26 [INFO] Starting CMS Application - Run 4
15:43:01 +0000 2016-01-26 [INFO] Reading the BOINC volunteer's information
15:43:03 +0000 2016-01-26 [INFO] Volunteer: ivan (9) Host: 22
15:43:03 +0000 2016-01-26 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
15:43:03 +0000 2016-01-26 [INFO] Requesting an X509 credential from CMS-Dev
15:43:05 +0000 2016-01-26 [INFO] Requesting an X509 credential from LHC@home
curl -s -u 9:d9a4d5ccdeaae49eff789ed77c04b53c --capath /etc/grid-security/certificates/ https://cms-data-bridge.cern.ch/boinc-auth/get_proxy -o /tmp/x509up_u500
15:43:06 +0000 2016-01-26 [INFO] Cloud not get a proxy from CMS-Dev
15:43:06 +0000 2016-01-26 [INFO] Going to sleep for 1 hour


Messages sent.
ID: 1644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1126
Credit: 7,861,186
RAC: 5
Message 1646 - Posted: 27 Jan 2016, 9:41:42 UTC - in response to Message 1644.  

Looks like we may be up again:

08:31:08 +0000 2016-01-27 [INFO] Cloud not get a proxy from CMS-Dev
08:31:08 +0000 2016-01-27 [INFO] Going to sleep for 1 hour
09:33:01 +0000 2016-01-27 [INFO] Starting CMS Application - Run 8
09:33:01 +0000 2016-01-27 [INFO] Reading the BOINC volunteer's information
09:33:03 +0000 2016-01-27 [INFO] Volunteer: ivan (9) Host: 22
09:33:03 +0000 2016-01-27 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
09:33:03 +0000 2016-01-27 [INFO] Requesting an X509 credential from CMS-Dev
subject : /O=Volunteer Computing/O=CERN/CN=ivan 9/CN=1325983332
issuer : /O=Volunteer Computing/O=CERN/CN=ivan 9
identity : /O=Volunteer Computing/O=CERN/CN=ivan 9
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:59:56 (5.4 days)
09:33:08 +0000 2016-01-27 [INFO] Downloading glidein
09:33:08 +0000 2016-01-27 [INFO] Running glidein (check logs)


A current task should pick up a new job once its one-hour pause is over. That's the theory, anyway...

Thanks, Laurence.
ID: 1646 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 5 · Next

Message boards : Number crunching : Current issues


©2024 CERN