Message boards : News : Jobs incoming!
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1042 - Posted: 8 Sep 2015, 14:00:36 UTC - in response to Message 1025.  
Last modified: 8 Sep 2015, 14:33:59 UTC

Hi Ivan,
is it possible to put an indicator of jobs available on the SSP?
The "Tasks ready to send" is a pointless figure.

It should be possible -- how to actually do it is another matter! I'll give it some thought.

OK, it took me fart oo long to install a local copy of Condor, the version in the standard Scientific Linux CERN repository was too old; in the end I installed the very latest from rpm. So now I can get the number of running and available jobs remotely from the RAL machine. The question now is how to get them from here to the SSP?
Ah! I just realised that this isn't really the number of available jobs, as Condor only holds ~1000 jobs in its (visible) queue, so if we get this working you'll never see much more than 1000 available, but you would see when the number starts falling below 1000.
After a bit of playing around:
[eesridr:src] > . stats
Tue Sep  8 14:32:57 UTC 2015: 64 running, 168 idle

ID: 1042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1043 - Posted: 8 Sep 2015, 14:40:38 UTC - in response to Message 1042.  

That is great! Thanks for your effort.
At least there would be an indication, when jobs are going to run out.
ID: 1043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1045 - Posted: 8 Sep 2015, 15:12:30 UTC - in response to Message 1041.  
Last modified: 8 Sep 2015, 15:20:32 UTC

I think, i know now why there are missing run-folders.
cron-stdout:

16:25:01 +0200 2015-09-08 [INFO] Starting CMS Application - Run 1
16:25:01 +0200 2015-09-08 [INFO] Reading the BOINC volunteer's information
16:25:02 +0200 2015-09-08 [INFO] Volunteer: Rasputin42 (277) Host: 617
16:25:02 +0200 2015-09-08 [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
16:25:02 +0200 2015-09-08 [INFO] Requesting an X509 credential
16:25:05 +0200 2015-09-08 [ERROR] Proxy error
16:25:05 +0200 2015-09-08 [INFO] Going to sleep for 1 hour
16:26:01 +0200 2015-09-08 [INFO] Starting CMS Application - Run 2
16:26:01 +0200 2015-09-08 [INFO] Reading the BOINC volunteer's information
16:26:02 +0200 2015-09-08 [INFO] Volunteer: Rasputin42 (277) Host: 617
16:26:02 +0200 2015-09-08 [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
16:26:02 +0200 2015-09-08 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277/CN=803544860
issuer : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277
identity : /O=Volunteer Computing/O=CERN/CN=Rasputin42 277
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:59:01 (5.4 days)
16:26:03 +0200 2015-09-08 [INFO] Downloading glidein
16:26:06 +0200 2015-09-08 [INFO] Running glidein (check logs)

cron-sterr:

chmod: cannot access `/tmp/x509up_u500': No such file or directory

ERROR: Couldn't find a valid proxy.
globus_sysconfig: Could not find a valid proxy certificate file location
globus_sysconfig: Error with key filename
globus_sysconfig: File does not exist: /tmp/x509up_u500 is not a valid file

Use -debug for further information.

This is similar to the error that was reported before.
ID: 1045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,660,983
RAC: 15,052
Message 1046 - Posted: 8 Sep 2015, 15:17:10 UTC - in response to Message 1045.  

I think, i know now why ther are missing run-folders.
crn-stdout

16:25:01 +0200 2015-09-08 [INFO] Starting CMS Application - Run 1
16:25:01 +0200 2015-09-08 [INFO] Reading the BOINC volunteer's information
16:25:02 +0200 2015-09-08 [INFO] Volunteer: Rasputin42 (277) Host: 617
16:25:02 +0200 2015-09-08 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
16:25:02 +0200 2015-09-08 [INFO] Requesting an X509 credential
16:25:05 +0200 2015-09-08 [ERROR] Proxy error
16:25:05 +0200 2015-09-08 [INFO] Going to sleep for 1 hour

This is similar to the error that was reported before.

So it isn't just me getting these errors then !
ID: 1046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1047 - Posted: 8 Sep 2015, 15:21:47 UTC - in response to Message 1046.  

Apparently not.
However, in my case, it recovers immediately.
ID: 1047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 202
Message 1048 - Posted: 8 Sep 2015, 16:14:08 UTC - in response to Message 1047.  

Took a quick look at a host here, seems to work first time so perhaps
a timing problem somewhere.

17:02:03 +0100 2015-09-08 [INFO] CMS glidein Run 0 ended
17:03:01 +0100 2015-09-08 [INFO] Starting CMS Application - Run 1
17:03:01 +0100 2015-09-08 [INFO] Reading the BOINC volunteer's information
17:03:05 +0100 2015-09-08 [INFO] Volunteer: m (178) Host: 243
17:03:05 +0100 2015-09-08 [INFO] VMID: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
17:03:05 +0100 2015-09-08 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=m 178/CN=2116421106
issuer : /O=Volunteer Computing/O=CERN/CN=m 178
identity : /O=Volunteer Computing/O=CERN/CN=m 178
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:59:58 (5.4 days)
17:03:13 +0100 2015-09-08 [INFO] Downloading glidein
17:03:20 +0100 2015-09-08 [INFO] Running glidein (check logs)
ID: 1048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1049 - Posted: 8 Sep 2015, 16:19:42 UTC - in response to Message 1043.  

That is great! Thanks for your effort.
At least there would be an indication, when jobs are going to run out.

Yeah, there are a couple of air-gaps/firewalls that I can't see my way though just yet, so don't hold your breath too long. :-/
ID: 1049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1050 - Posted: 8 Sep 2015, 16:24:50 UTC - in response to Message 1048.  
Last modified: 8 Sep 2015, 16:26:18 UTC

Do you(or anyone else) have any missing run-x folders on any of your hosts?

run-1, run-2, run-4.......
x
ID: 1050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 202
Message 1051 - Posted: 8 Sep 2015, 17:08:44 UTC - in response to Message 1050.  
Last modified: 8 Sep 2015, 17:13:35 UTC

The short answer is no. But the hosts here run overnight and are switched off
during the day. Although the directory structure does appear to be preserved
it may not be typical.
I started one manually to look at this, only run-1 and run-2 so far. However
it's spent the last hour trying in vain to get another CMS job,

Edit.Went to turn it off, only to find it working away so it's been reprieved
for now.
ID: 1051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 824,998
RAC: 1,080
Message 1052 - Posted: 8 Sep 2015, 19:05:37 UTC - in response to Message 1050.  

Do you(or anyone else) have any missing run-x folders on any of your hosts?

run-1, run-2, run-4.......
x

Yeah, never seen this, but when you're asking: it happens ;)

Map run-1 not created.

First map is run-2/

Contents of cron-stdout:
20:43:01 +0200 2015-09-08 [INFO] Starting CMS Application - Run 1
20:43:01 +0200 2015-09-08 [INFO] Reading the BOINC volunteer's information
20:43:02 +0200 2015-09-08 [INFO] Volunteer: Crystal Pellet (38) Host: 37
20:43:02 +0200 2015-09-08 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
20:43:02 +0200 2015-09-08 [INFO] Requesting an X509 credential
20:43:05 +0200 2015-09-08 [ERROR] Proxy error
20:43:05 +0200 2015-09-08 [INFO] Going to sleep for 1 hour
20:44:01 +0200 2015-09-08 [INFO] Starting CMS Application - Run 2
20:44:01 +0200 2015-09-08 [INFO] Reading the BOINC volunteer's information
20:44:02 +0200 2015-09-08 [INFO] Volunteer: Crystal Pellet (38) Host: 37
20:44:02 +0200 2015-09-08 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
20:44:02 +0200 2015-09-08 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38/CN=1660116057
issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38
identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 38
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 130:00:00 (5.4 days)
20:44:04 +0200 2015-09-08 [INFO] Downloading glidein
20:44:08 +0200 2015-09-08 [INFO] Running glidein (check logs)



Sleeping for 1 hour is 1 minute.
ID: 1052 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,660,983
RAC: 15,052
Message 1054 - Posted: 8 Sep 2015, 20:40:45 UTC - in response to Message 1052.  

Well it's nice to know it isn't just happening to me but at least all of your systems seem to be able to recover and eventually get a credential.

Once mine fail they don't recover, I've tried turning both the router and modem off today but not as long as I did on Saturday when I first noticed it. Neither of these actions had any effect so at the moment I have suspended them :-(
ID: 1054 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1056 - Posted: 8 Sep 2015, 22:50:42 UTC - in response to Message 1051.  

TI started one manually to look at this, only run-1 and run-2 so far. However
it's spent the last hour trying in vain to get another CMS job,

Edit.Went to turn it off, only to find it working away so it's been reprieved
for now.

Heh! We were running out of the batch I submitted yesterday (1000x 250-event MinBias jobs), so I submitted another 10,000 late this afternoon. I think that'll keep you busy for a week or so!
I'm off to a UK GRIDPP meeting tomorrow for the rest of the week (I decided not to go to the CMS Week in Ischia this week, Italian food is so boring...), so you may not hear too much from me until Saturday. (I might be up early Saturday, landscapers are coming to clear the jungle that I laughingly call my garden, and they have petrol-powered implements of destruction...)
ID: 1056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1060 - Posted: 9 Sep 2015, 0:21:52 UTC
Last modified: 9 Sep 2015, 0:26:34 UTC

Events are very short. A lot of them are only a few seconds long (3-9 per minute)!!!

Is that OK?
ID: 1060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1062 - Posted: 9 Sep 2015, 9:08:23 UTC - in response to Message 1060.  

Events are very short. A lot of them are only a few seconds long (3-9 per minute)!!!

Is that OK?

Yes, they are minimal interactions with the occasional chance of something more complex. Though it sounds as if you have a reasonably fast processor...
ID: 1062 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1083 - Posted: 15 Sep 2015, 20:27:32 UTC
Last modified: 15 Sep 2015, 21:02:18 UTC

This batch of jobs is coming to a close soon:
Tue Sep 15 20:20:01 UTC 2015: 66 running, 833 idle
(Oh, Laurence is looking into whether we can get that info onto the Status Page.)
I'll submit another lot tomorrow with fewer events, to see if that affects the number of jobs that don't report results -- we may be reaching a limit on the practical size of result files.

[Later] Actually, that status may be misleading, there may be other jobs in the queue too -- Dashboard thinks there are still 5,000 jobs to run. I've fewer than 6,000 results returned and delving into the Condor status I see:
[cms005@lcggwms02:~] > grep STATUS_READY 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
4380 26280 170820
Tue Sep 15 21:51:25
[cms005@lcggwms02:~] > grep STATUS_DONE 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
4315 25890 163970
Tue Sep 15 21:51:40
[cms005@lcggwms02:~] > grep STATUS_ERROR 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
422 2532 16458
Tue Sep 15 21:51:44
[cms005@lcggwms02:~] > grep STATUS_PRERUN 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
0 0 0
Tue Sep 15 21:52:12
[cms005@lcggwms02:~] > grep STATUS_SUBMITTED 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
884 5305 38014
Tue Sep 15 21:52:28
[cms005@lcggwms02:~] > grep STATUS_POSTRUN 150908_152652:ireid_crab_CMS_at_Home_MinBias2/node_state.txt|wc
1 6 41

To be investigated tomorrow...
ID: 1083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,660,983
RAC: 15,052
Message 1084 - Posted: 16 Sep 2015, 7:19:15 UTC - in response to Message 1083.  

Roughly 10% error rate doesn't look good, do you know what's causing that ?
ID: 1084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1085 - Posted: 16 Sep 2015, 8:57:17 UTC - in response to Message 1084.  

Roughly 10% error rate doesn't look good, do you know what's causing that ?

I suspect it's the larger result-file size, which should show up soon enough when I halve the size of the jobs (could be the longer run time leading to more losses on suspend/resume too). At the moment I'm more puzzled at the discrepancy between the Condor queue reports and the actual returned results.
ID: 1085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,894,492
RAC: 1,886
Message 1086 - Posted: 16 Sep 2015, 11:19:53 UTC

Hmm, this doesn't look good!



Lots of jobs failing after two retries.
ID: 1086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,449
RAC: 238
Message 1087 - Posted: 16 Sep 2015, 12:17:25 UTC - in response to Message 1086.  

Looking at the logs of my last job, everything seems fine (exit status 0). Might be something on the Condor server.
ID: 1087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1184
Credit: 824,998
RAC: 1,080
Message 1088 - Posted: 16 Sep 2015, 12:58:15 UTC - in response to Message 1083.  

This batch of jobs is coming to a close soon:
Tue Sep 15 20:20:01 UTC 2015: 66 running, 833 idle

I've job 5850 running out of the 10,000 you announced last week.
So I'm guessing why you think we're soon to a close.
Are there future jobs purged from the queue?
ID: 1088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · Next

Message boards : News : Jobs incoming!


©2024 CERN