Thread 'Jobs incoming!'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 913 - Posted: 31 Aug 2015, 16:24:10 UTC Last modified: 31 Aug 2015, 16:41:32 UTC Patches have been applied, jobs should be ready when you want them, Enjoy! [Edit] Confirmed, jobs are available now. [/Edit] ID: 913 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 914 - Posted: 31 Aug 2015, 17:37:36 UTC - in response to Message 913. Are you sure ? Not seeing anything coming through on the ones I've checked. Just a bit of activity for wget then cvmfs2 and then an empty run- folder till the next one starts. ID: 914 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 915 - Posted: 31 Aug 2015, 18:01:48 UTC - in response to Message 914. Last modified: 31 Aug 2015, 18:06:37 UTC Are you sure ? Not seeing anything coming through on the ones I've checked. Just a bit of activity for wget then cvmfs2 and then an empty run- folder till the next one starts. Yes, I've got one starting up at home right now (a bit slow over a 2.5 Mbps link...). And the task at work is nearly finished a job (these jobs run 25 events each). Perhaps you need to abort the task and start a new one, to pick up an active glide-in? Also note that these jobs spend a lot of time at start-up with low CPU usage before downloading lots of data and starting to compute; I suspect a conditions-database-server is AWOL and the request has to time out before falling over to a backup server. From the Condor server I see that there are 17 jobs currently running, and infer (from timestamps) that 10 jobs have already completed. CMS Dashboard is reporting one success, three failures and two in post-processing, but it's notoriously slow and therefore unreliable in the short term; it's reporting 1293 jobs in this batch when in fact I submitted 2000. ID: 915 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 916 - Posted: 31 Aug 2015, 18:16:54 UTC - in response to Message 915. It's possible the ones running before things restarted might need stopping but one that kicked off since you posted your message says this at the bottom of the glidein-stderr file... used default retire time, 21600 Proxy not long lived enough (73173 s left), shortened retire time to -56427 using default retire spread, -5642 Retire time after spread too low (-54001), remove spread Mon Aug 31 10:56:37 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 10:56:37 PDT 2015 Sleeping 306 Mon Aug 31 11:01:44 PDT 2015 Sleeping 252 Mon Aug 31 11:05:56 PDT 2015 Sleeping 264 At the moment I'm sitting in the damp UK but this point in the boot.log is when I move to what I expect is a nice sunny beach in the Pacific region... Mon Aug 31 17:47:16 2015: Starting rpcbind: ^[[60G[^[[0;32m OK ^[[0;39m] Mon Aug 31 09:47:20 2015: Starting CernVM: ^[[60G[^[[0;32m OK ^[[0;39m] ID: 916 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 917 - Posted: 31 Aug 2015, 18:21:46 UTC - in response to Message 916. This was at the bottom of the glidein-stdout... === XML description of glidein activity === <?xml version="1.0"?> <OSGTestResult id="glidein_startup.sh" version="4.3.1"> <operatingenvironment> <env name="client_name"></env> <env name="client_group">main</env> <env name="user">boinc</env> <env name="arch">x86_64</env> <env name="os">Scientific Linux release 6.5 (Carbon)</env> <env name="hostname">246-563-4107</env> <env name="cwd">/home/boinc/CMSRun</env> </operatingenvironment> <test> <tStart>2015-08-31T10:56:03-07:00</tStart> <tEnd>2015-08-31T10:56:37-07:00</tEnd> </test> <result> <status>ERROR</status> <metric name="TestID" ts="2015-08-31T10:56:37-07:00" uri="local">condor_startup.sh</metric> <metric name="failure" ts="2015-08-31T10:56:37-07:00" uri="local">Config</metric> <metric name="retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">-56427</metric> <metric name="min_retire_time" ts="2015-08-31T10:56:37-07:00" uri="local">600</metric> </result> <detail> Validation failed in condor_startup.sh. Retire time still too low (-56427), aborting </detail> </OSGTestResult> === End XML description of glidein activity === ID: 917 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 918 - Posted: 31 Aug 2015, 18:34:48 UTC - in response to Message 917. Last modified: 31 Aug 2015, 18:35:13 UTC Hmm, I'll have to let Laurence et al. comment on that, it's beyond my ken. All I know is that it's working for me. :-/ (Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!) ID: 918 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 919 - Posted: 31 Aug 2015, 18:40:45 UTC - in response to Message 918. The job on my laptop in front of me will start afresh in about 20 minutes so will see if that fares any better with a clean environment. ID: 919 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 920 - Posted: 31 Aug 2015, 19:09:01 UTC - in response to Message 919. Laptop is doing the same, first run again says this... ----------------------------------------------------- used default retire time, 21600 Proxy not long lived enough (69090 s left), shortened retire time to -60510 using default retire spread, -6051 Retire time after spread too low (-58695), remove spread Mon Aug 31 12:04:41 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 12:04:43 PDT 2015 Sleeping 253 ID: 920 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 921 - Posted: 31 Aug 2015, 19:21:03 UTC - in response to Message 920. Last modified: 31 Aug 2015, 19:27:42 UTC That definitely is something for Laurence and the team, I'm not au fait with how the proxies are set up for our purposes. Meanwhile, there are now 20 jobs running, and 13 reported as "done" by Condor. Dashboard says 11 success, 7 failed and two in post-processing; there are 14 results returned to stage-out. ID: 921 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 922 - Posted: 31 Aug 2015, 19:33:56 UTC - in response to Message 921. Ok. Any plans to add the Condor stats to the server status page so us mere mortals can see what is ready to run, running, done ? ID: 922 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0	Message 923 - Posted: 31 Aug 2015, 19:46:11 UTC - in response to Message 913. Patches have been applied, jobs should be ready when you want them, Enjoy! [Edit] Confirmed, jobs are available now. [/Edit] Yep, got 2 machines up and working. ID: 923 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 924 - Posted: 31 Aug 2015, 19:54:16 UTC - in response to Message 922. Last modified: 31 Aug 2015, 19:55:48 UTC Ok. Any plans to add the Condor stats to the server status page so us mere mortals can see what is ready to run, running, done ? I'm not sure how easy that would be to do, and they can be hard to understand -- it took me a while to realise that the Condor queue only reports on the first 1000 or so jobs ready to run, e.g. at the moment the summary line is 1037 jobs; 0 completed, 0 removed, 1002 idle, 20 running, 15 held, 0 suspended whereas there are something close to 2000 jobs still to run. The "20 running" includes the DAG job that's controlling the queue, the "15 held" are DAG jobs that have finished running previous batches (one day I'll have to work out how to delete them cleanly...) It's probably trivial to run a cron job to export the summary somewhere (harder to pull the summary; the server is behind a firewall and not only requires a non-standard port, but also needs a registered certificate to be used). Getting it into the status page may require more gymnastics. Probably not impossible, but given the intense "interest" our latest problems have stirred up, probably also to be superseded by newer developments. ID: 924 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 925 - Posted: 31 Aug 2015, 20:22:44 UTC - in response to Message 924. Alright, add it to the project wishlist please :-) Right up there with badges ! Did a search, couldn't believe no-one else hasn't asked for them yet ! Stop teasing us with snippets of what is or isn't a political hot potato in the scientific CERN world, you said in an earlier thread that you were happy to be fired for telling us stuff ;-) ID: 925 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 926 - Posted: 31 Aug 2015, 20:35:16 UTC - in response to Message 917. The proxy should be valid for 130 hours (5.6 days) and the VMs only run for 24 hours. There may be an issue if you suspend the VM then resume a few days later. I have added some logging so the time left of the proxy is shown. I would suggest that for now you just abort the Task as see if this solves the problem. If not, it may be something else .... ID: 926 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 927 - Posted: 31 Aug 2015, 20:41:16 UTC - in response to Message 926. Nothing got suspended here, running flat out, will go and see which other machines have started new jobs since Ivan posted the news and see if any of them are working productively... ID: 927 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 928 - Posted: 31 Aug 2015, 20:50:29 UTC - in response to Message 925. Last modified: 31 Aug 2015, 20:55:43 UTC Alright, add it to the project wishlist please :-) Right up there with badges ! Did a search, couldn't believe no-one else hasn't asked for them yet ! I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!] ^Wbadges -- there is a switch somewhere to turn them off, perhaps you've activated it? Stop teasing us with snippets of what is or isn't a political hot potato in the scientific CERN world, you said in an earlier thread that you were happy to be fired for telling us stuff ;-) Ah, yes, well... It's one thing for me to be fired, it's another to jeopardise the project! The fact that we're seeing jobs again is evidence that, perhaps, some heads have cooled down again. But there's a whole lot of influence areas (for want of a better term) that have been woken up and find that others have been encroaching on their perceived domains. Or something... A meeting of the minds is planned to work out who gets the kudos, and who get the -- well, I'll let you guess that one. I said I wasn't a politician. Oh, we're up to 25 jobs running, and 30 results returned. We're running a bit behind the same task on the real GRID, but then I had over 1600 jobs at once running there -- we ain't gonna do that with 100 or so pre-beta testers! ID: 928 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 929 - Posted: 31 Aug 2015, 21:13:46 UTC - in response to Message 928. I see badgers [Badgers, badgers, badgers, badgers, badgers,... snake!] ^Wbadges -- there is a switch somewhere to turn them off, perhaps you've activated it? For a moment there I thought you'd got stuck at a rained off barbie and had a few too many, but now I see you just meant mushroom ! ID: 929 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,047,486 RAC: 56	Message 930 - Posted: 31 Aug 2015, 21:18:16 UTC - in response to Message 918. Well, the job at home still hasn't completed downloading whatever-it-is that cmsRun needs before it can start; CPU time is up to 19 secs, there's been 20-30 minutes of 2.5 Mbps download -- Whoops, download finished, cmsRun up to 90+% CPU time; ALT+F5 shows "Begin processing the 1st record"!) The initial download before the first cmsRun upped >90% CPU, let the Virtual Harddisk increase with 1.4 GB to 2.85GB Glad there's progress again. Hopefully you are able to iron out all important issues, so you can start your BOINC <==> GRID comparison. ID: 930 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 931 - Posted: 31 Aug 2015, 21:58:52 UTC - in response to Message 930. One of the volunteers pointed out a potential issue with the proxy so I have tweaked a few timeout values. Hope that helps! ID: 931 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 932 - Posted: 31 Aug 2015, 22:20:35 UTC - in response to Message 931. I aborted the job and even reset the project and can now see this sort of stuff in cron-stdout... type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 16:05:58 15:15:12 -0700 2015-08-31 [INFO] Downloading glidein 15:15:28 -0700 2015-08-31 [INFO] Running glidein (check logs) but it is still failing with... used default retire time, 21600 Proxy not long lived enough (57538 s left), shortened retire time to -72062 using default retire spread, -7206 Retire time after spread too low (-68820), remove spread Mon Aug 31 15:17:13 PDT 2015 Error running 'condor_startup.sh' Mon Aug 31 15:17:15 PDT 2015 Sleeping 300 ID: 932 · Rating: 0 · rate: / Reply Quote

Development for LHC@home