Message boards :
News :
Jobs incoming!
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Patches have been applied, jobs should be ready when you want them, Enjoy! [Edit] Confirmed, jobs are available now. [/Edit] |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Are you sure ? Not seeing anything coming through on the ones I've checked. Just a bit of activity for wget then cvmfs2 and then an empty run- folder till the next one starts. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Are you sure ? Yes, I've got one starting up at home right now (a bit slow over a 2.5 Mbps link...). And the task at work is nearly finished a job (these jobs run 25 events each). Perhaps you need to abort the task and start a new one, to pick up an active glide-in? Also note that these jobs spend a lot of time at start-up with low CPU usage before downloading lots of data and starting to compute; I suspect a conditions-database-server is AWOL and the request has to time out before falling over to a backup server. From the Condor server I see that there are 17 jobs currently running, and infer (from timestamps) that 10 jobs have already completed. CMS Dashboard is reporting one success, three failures and two in post-processing, but it's notoriously slow and therefore unreliable in the short term; it's reporting 1293 jobs in this batch when in fact I submitted 2000. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
It's possible the ones running before things restarted might need stopping but one that kicked off since you posted your message says this at the bottom of the glidein-stderr file... used default retire time, 21600 Proxy not long lived enough (73173 s left), shortened retire time to -56427 using default retire spread, -5642 Retire time after spread too low (-54001), remove spread Mon Aug 31 10:56:37 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 10:56:37 PDT 2015 Sleeping 306 Mon Aug 31 11:01:44 PDT 2015 Sleeping 252 Mon Aug 31 11:05:56 PDT 2015 Sleeping 264 At the moment I'm sitting in the damp UK but this point in the boot.log is when I move to what I expect is a nice sunny beach in the Pacific region... Mon Aug 31 17:47:16 2015: Starting rpcbind: ^[[60G[^[[0;32m OK ^[[0;39m] Mon Aug 31 09:47:20 2015: Starting CernVM: ^[[60G[^[[0;32m OK ^[[0;39m] |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
This was at the bottom of the glidein-stdout... === XML description of glidein activity === <?xml version="1.0"?> <OSGTestResult id="glidein_startup.sh" version="4.3.1"> <operatingenvironment> <env name="client_name"></env> <env name="client_group">main</env> <env name="user">boinc</env> <env name="arch">x86_64</env> <env name="os">Scientific Linux release 6.5 (Carbon)</env> <env name="hostname">246-563-4107</env> <env name="cwd">/home/boinc/CMSRun</env> </operatingenvironment> <test> <tStart>2015-08-31T10:56:03-07:00</tStart> <tEnd>2015-08-31T10:56:37-07:00</tEnd> </test> <result> <status>ERROR</status> <metric name="TestID" ts="2015-08-31T10:56:37-07:00" uri="local">condor_startup.sh</metric> <metric name="failure" ts="2015-08-31T10:56:37-07:00" uri="local">Config</metric> <metric name="retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">-56427</metric> <metric name="min_retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">600</metric> </result> <detail> Validation failed in condor_startup.sh. Retire time still too low (-56427), aborting </detail> </OSGTestResult> === End XML description of glidein activity === |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Hmm, I'll have to let Laurence et al. comment on that, it's beyond my ken. All I know is that it's working for me. :-/ (Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!) |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
The job on my laptop in front of me will start afresh in about 20 minutes so will see if that fares any better with a clean environment. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Laptop is doing the same, first run again says this... ----------------------------------------------------- used default retire time, 21600 Proxy not long lived enough (69090 s left), shortened retire time to -60510 using default retire spread, -6051 Retire time after spread too low (-58695), remove spread Mon Aug 31 12:04:41 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 12:04:43 PDT 2015 Sleeping 253 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
That definitely is something for Laurence and the team, I'm not au fait with how the proxies are set up for our purposes. Meanwhile, there are now 20 jobs running, and 13 reported as "done" by Condor. Dashboard says 11 success, 7 failed and two in post-processing; there are 14 results returned to stage-out. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Ok. Any plans to add the Condor stats to the server status page so us mere mortals can see what is ready to run, running, done ? |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 |
Patches have been applied, jobs should be ready when you want them, Enjoy! Yep, got 2 machines up and working. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Ok. I'm not sure how easy that would be to do, and they can be hard to understand -- it took me a while to realise that the Condor queue only reports on the first 1000 or so jobs ready to run, e.g. at the moment the summary line is 1037 jobs; 0 completed, 0 removed, 1002 idle, 20 running, 15 held, 0 suspendedwhereas there are something close to 2000 jobs still to run. The "20 running" includes the DAG job that's controlling the queue, the "15 held" are DAG jobs that have finished running previous batches (one day I'll have to work out how to delete them cleanly...) It's probably trivial to run a cron job to export the summary somewhere (harder to pull the summary; the server is behind a firewall and not only requires a non-standard port, but also needs a registered certificate to be used). Getting it into the status page may require more gymnastics. Probably not impossible, but given the intense "interest" our latest problems have stirred up, probably also to be superseded by newer developments. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Alright, add it to the project wishlist please :-) Right up there with badges ! Did a search, couldn't believe no-one else hasn't asked for them yet ! Stop teasing us with snippets of what is or isn't a political hot potato in the scientific CERN world, you said in an earlier thread that you were happy to be fired for telling us stuff ;-) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The proxy should be valid for 130 hours (5.6 days) and the VMs only run for 24 hours. There may be an issue if you suspend the VM then resume a few days later. I have added some logging so the time left of the proxy is shown. I would suggest that for now you just abort the Task as see if this solves the problem. If not, it may be something else .... |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
Nothing got suspended here, running flat out, will go and see which other machines have started new jobs since Ivan posted the news and see if any of them are working productively... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 10 |
Alright, add it to the project wishlist please :-)I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!] ^Wbadges -- there is a switch somewhere to turn them off, perhaps you've activated it?
Ah, yes, well... It's one thing for me to be fired, it's another to jeopardise the project! The fact that we're seeing jobs again is evidence that, perhaps, some heads have cooled down again. But there's a whole lot of influence areas (for want of a better term) that have been woken up and find that others have been encroaching on their perceived domains. Or something... A meeting of the minds is planned to work out who gets the kudos, and who get the -- well, I'll let you guess that one. I said I wasn't a politician. Oh, we're up to 25 jobs running, and 30 results returned. We're running a bit behind the same task on the real GRID, but then I had over 1600 jobs at once running there -- we ain't gonna do that with 100 or so pre-beta testers! |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!] For a moment there I thought you'd got stuck at a rained off barbie and had a few too many, but now I see you just meant mushroom ! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 874,807 RAC: 1,040 |
Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!) The initial download before the first cmsRun upped >90% CPU, let the Virtual Harddisk increase with 1.4 GB to 2.85GB Glad there's progress again. Hopefully you are able to iron out all important issues, so you can start your BOINC <==> GRID comparison. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
One of the volunteers pointed out a potential issue with the proxy so I have tweaked a few timeout values. Hope that helps! |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 134 |
I aborted the job and even reset the project and can now see this sort of stuff in cron-stdout... type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 16:05:58 15:15:12 -0700 2015-08-31 [INFO] Downloading glidein 15:15:28 -0700 2015-08-31 [INFO] Running glidein (check logs) but it is still failing with... used default retire time, 21600 Proxy not long lived enough (57538 s left), shortened retire time to -72062 using default retire spread, -7206 Retire time after spread too low (-68820), remove spread Mon Aug 31 15:17:13 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 15:17:15 PDT 2015 Sleeping 300 |
©2025 CERN