Message boards : News : Jobs incoming!
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 913 - Posted: 31 Aug 2015, 16:24:10 UTC
Last modified: 31 Aug 2015, 16:41:32 UTC

Patches have been applied, jobs should be ready when you want them, Enjoy!

[Edit] Confirmed, jobs are available now. [/Edit]
ID: 913 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 914 - Posted: 31 Aug 2015, 17:37:36 UTC - in response to Message 913.  

Are you sure ?
Not seeing anything coming through on the ones I've checked.
Just a bit of activity for wget then cvmfs2 and then an empty run- folder till the next one starts.
ID: 914 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 915 - Posted: 31 Aug 2015, 18:01:48 UTC - in response to Message 914.  
Last modified: 31 Aug 2015, 18:06:37 UTC

Are you sure ?
Not seeing anything coming through on the ones I've checked.
Just a bit of activity for wget then cvmfs2 and then an empty run- folder till the next one starts.

Yes, I've got one starting up at home right now (a bit slow over a 2.5 Mbps link...). And the task at work is nearly finished a job (these jobs run 25 events each).
Perhaps you need to abort the task and start a new one, to pick up an active glide-in? Also note that these jobs spend a lot of time at start-up with low CPU usage before downloading lots of data and starting to compute; I suspect a conditions-database-server is AWOL and the request has to time out before falling over to a backup server.
From the Condor server I see that there are 17 jobs currently running, and infer (from timestamps) that 10 jobs have already completed. CMS Dashboard is reporting one success, three failures and two in post-processing, but it's notoriously slow and therefore unreliable in the short term; it's reporting 1293 jobs in this batch when in fact I submitted 2000.
ID: 915 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 916 - Posted: 31 Aug 2015, 18:16:54 UTC - in response to Message 915.  

It's possible the ones running before things restarted might need stopping but one that kicked off since you posted your message says this at the bottom of the glidein-stderr file...

used default retire time, 21600
Proxy not long lived enough (73173 s left), shortened retire time to -56427
using default retire spread, -5642
Retire time after spread too low (-54001), remove spread
Mon Aug 31 10:56:37 PDT 2015 Error running 'condor_startup.sh'
Mon Aug 31 10:56:37 PDT 2015 Sleeping 306
Mon Aug 31 11:01:44 PDT 2015 Sleeping 252
Mon Aug 31 11:05:56 PDT 2015 Sleeping 264


At the moment I'm sitting in the damp UK but this point in the boot.log is when I move to what I expect is a nice sunny beach in the Pacific region...

Mon Aug 31 17:47:16 2015: Starting rpcbind: ^[[60G[^[[0;32m OK ^[[0;39m]
Mon Aug 31 09:47:20 2015: Starting CernVM: ^[[60G[^[[0;32m OK ^[[0;39m]
ID: 916 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 917 - Posted: 31 Aug 2015, 18:21:46 UTC - in response to Message 916.  

This was at the bottom of the glidein-stdout...


=== XML description of glidein activity ===
<?xml version="1.0"?>
<OSGTestResult id="glidein_startup.sh" version="4.3.1">
<operatingenvironment>
<env name="client_name"></env>
<env name="client_group">main</env>
<env name="user">boinc</env>
<env name="arch">x86_64</env>
<env name="os">Scientific Linux release 6.5 (Carbon)</env>
<env name="hostname">246-563-4107</env>
<env name="cwd">/home/boinc/CMSRun</env>
</operatingenvironment>
<test>
<tStart>2015-08-31T10:56:03-07:00</tStart>
<tEnd>2015-08-31T10:56:37-07:00</tEnd>
</test>
<result>
<status>ERROR</status>
<metric name="TestID" ts="2015-08-31T10:56:37-07:00" uri="local">condor_startup.sh</metric>
<metric name="failure" ts="2015-08-31T10:56:37-07:00" uri="local">Config</metric>
<metric name="retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">-56427</metric>
<metric name="min_retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">600</metric>
</result>
<detail>
Validation failed in condor_startup.sh.

Retire time still too low (-56427), aborting
</detail>
</OSGTestResult>
=== End XML description of glidein activity ===
ID: 917 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 918 - Posted: 31 Aug 2015, 18:34:48 UTC - in response to Message 917.  
Last modified: 31 Aug 2015, 18:35:13 UTC

Hmm, I'll have to let Laurence et al. comment on that, it's beyond my ken. All I know is that it's working for me. :-/ (Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!)
ID: 918 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 919 - Posted: 31 Aug 2015, 18:40:45 UTC - in response to Message 918.  

The job on my laptop in front of me will start afresh in about 20 minutes so will see if that fares any better with a clean environment.
ID: 919 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 920 - Posted: 31 Aug 2015, 19:09:01 UTC - in response to Message 919.  

Laptop is doing the same, first run again says this...

-----------------------------------------------------
used default retire time, 21600
Proxy not long lived enough (69090 s left), shortened retire time to -60510
using default retire spread, -6051
Retire time after spread too low (-58695), remove spread
Mon Aug 31 12:04:41 PDT 2015 Error running 'condor_startup.sh'
Mon Aug 31 12:04:43 PDT 2015 Sleeping 253
ID: 920 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 921 - Posted: 31 Aug 2015, 19:21:03 UTC - in response to Message 920.  
Last modified: 31 Aug 2015, 19:27:42 UTC

That definitely is something for Laurence and the team, I'm not au fait with how the proxies are set up for our purposes.

Meanwhile, there are now 20 jobs running, and 13 reported as "done" by Condor. Dashboard says 11 success, 7 failed and two in post-processing; there are 14 results returned to stage-out.
ID: 921 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 922 - Posted: 31 Aug 2015, 19:33:56 UTC - in response to Message 921.  

Ok.

Any plans to add the Condor stats to the server status page so us mere mortals can see what is ready to run, running, done ?
ID: 922 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 923 - Posted: 31 Aug 2015, 19:46:11 UTC - in response to Message 913.  

Patches have been applied, jobs should be ready when you want them, Enjoy!

[Edit] Confirmed, jobs are available now. [/Edit]


Yep, got 2 machines up and working.
ID: 923 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 924 - Posted: 31 Aug 2015, 19:54:16 UTC - in response to Message 922.  
Last modified: 31 Aug 2015, 19:55:48 UTC

Ok.
Any plans to add the Condor stats to the server status page so us mere mortals can see what is ready to run, running, done ?

I'm not sure how easy that would be to do, and they can be hard to understand -- it took me a while to realise that the Condor queue only reports on the first 1000 or so jobs ready to run, e.g. at the moment the summary line is
1037 jobs; 0 completed, 0 removed, 1002 idle, 20 running, 15 held, 0 suspended
whereas there are something close to 2000 jobs still to run. The "20 running" includes the DAG job that's controlling the queue, the "15 held" are DAG jobs that have finished running previous batches (one day I'll have to work out how to delete them cleanly...)
It's probably trivial to run a cron job to export the summary somewhere (harder to pull the summary; the server is behind a firewall and not only requires a non-standard port, but also needs a registered certificate to be used). Getting it into the status page may require more gymnastics. Probably not impossible, but given the intense "interest" our latest problems have stirred up, probably also to be superseded by newer developments.
ID: 924 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 925 - Posted: 31 Aug 2015, 20:22:44 UTC - in response to Message 924.  

Alright, add it to the project wishlist please :-)
Right up there with badges !
Did a search, couldn't believe no-one else hasn't asked for them yet !

Stop teasing us with snippets of what is or isn't a political hot potato in the scientific CERN world, you said in an earlier thread that you were happy to be fired for telling us stuff ;-)
ID: 925 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 107
Message 926 - Posted: 31 Aug 2015, 20:35:16 UTC - in response to Message 917.  

The proxy should be valid for 130 hours (5.6 days) and the VMs only run for 24 hours. There may be an issue if you suspend the VM then resume a few days later. I have added some logging so the time left of the proxy is shown. I would suggest that for now you just abort the Task as see if this solves the problem. If not, it may be something else ....
ID: 926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 927 - Posted: 31 Aug 2015, 20:41:16 UTC - in response to Message 926.  

Nothing got suspended here, running flat out, will go and see which other machines have started new jobs since Ivan posted the news and see if any of them are working productively...
ID: 927 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,939,884
RAC: 3,177
Message 928 - Posted: 31 Aug 2015, 20:50:29 UTC - in response to Message 925.  
Last modified: 31 Aug 2015, 20:55:43 UTC

Alright, add it to the project wishlist please :-)
Right up there with badges !
Did a search, couldn't believe no-one else hasn't asked for them yet !
I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!]
^Wbadges -- there is a switch somewhere to turn them off, perhaps you've activated it?

Stop teasing us with snippets of what is or isn't a political hot potato in the scientific CERN world, you said in an earlier thread that you were happy to be fired for telling us stuff ;-)

Ah, yes, well... It's one thing for me to be fired, it's another to jeopardise the project! The fact that we're seeing jobs again is evidence that, perhaps, some heads have cooled down again. But there's a whole lot of influence areas (for want of a better term) that have been woken up and find that others have been encroaching on their perceived domains. Or something... A meeting of the minds is planned to work out who gets the kudos, and who get the -- well, I'll let you guess that one. I said I wasn't a politician.

Oh, we're up to 25 jobs running, and 30 results returned. We're running a bit behind the same task on the real GRID, but then I had over 1600 jobs at once running there -- we ain't gonna do that with 100 or so pre-beta testers!
ID: 928 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 929 - Posted: 31 Aug 2015, 21:13:46 UTC - in response to Message 928.  

I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!]
^Wbadges -- there is a switch somewhere to turn them off, perhaps you've activated it?

For a moment there I thought you'd got stuck at a rained off barbie and had a few too many, but now I see you just meant mushroom !
ID: 929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 930 - Posted: 31 Aug 2015, 21:18:16 UTC - in response to Message 918.  

Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!)

The initial download before the first cmsRun upped >90% CPU, let the Virtual Harddisk increase with 1.4 GB to 2.85GB

Glad there's progress again. Hopefully you are able to iron out all important issues, so you can start your BOINC <==> GRID comparison.
ID: 930 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 107
Message 931 - Posted: 31 Aug 2015, 21:58:52 UTC - in response to Message 930.  

One of the volunteers pointed out a potential issue with the proxy so I have tweaked a few timeout values. Hope that helps!
ID: 931 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,842,171
RAC: 17,203
Message 932 - Posted: 31 Aug 2015, 22:20:35 UTC - in response to Message 931.  

I aborted the job and even reset the project and can now see this sort of stuff in cron-stdout...

type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 16:05:58
15:15:12 -0700 2015-08-31 [INFO] Downloading glidein
15:15:28 -0700 2015-08-31 [INFO] Running glidein (check logs)

but it is still failing with...

used default retire time, 21600
Proxy not long lived enough (57538 s left), shortened retire time to -72062
using default retire spread, -7206
Retire time after spread too low (-68820), remove spread
Mon Aug 31 15:17:13 PDT 2015 Error running 'condor_startup.sh'
Mon Aug 31 15:17:15 PDT 2015 Sleeping 300
ID: 932 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : News : Jobs incoming!


©2024 CERN