Thread 'No new jobs'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1565 - Posted: 7 Jan 2016, 9:50:39 UTC Jobs are starting to run low. ID: 1565 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1566 - Posted: 7 Jan 2016, 14:11:43 UTC - in response to Message 1565. Jobs are starting to run low. Yes, I'll submit more later today -- smaller again to try to reduce the time-out problems we have been seeing. ID: 1566 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1568 - Posted: 7 Jan 2016, 18:28:14 UTC - in response to Message 1566. Jobs are starting to run low. Yes, I'll submit more later today -- smaller again to try to reduce the time-out problems we have been seeing. This may take some time: "Failed to submit task 160107_181338:ireid_crab_CMS_at_Home_MinBias_200ev; 'Failure when submitting task to scheduler. Error reason: '''" ID: 1568 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1569 - Posted: 7 Jan 2016, 20:09:00 UTC Well done, Ivan! Just in time for the end of the batch. ID: 1569 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1570 - Posted: 7 Jan 2016, 20:31:08 UTC - in response to Message 1569. Well done, Ivan! Just in time for the end of the batch. Yeah, I tried several times at work with no luck. Got out the side gate of Uni at 7pm just as Security was preparing to lock it (saving an extra 60 or 70% on my walk home) and finally got it to go through from home with only 4 or 5 jobs still in the queue. ID: 1570 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 37	Message 1571 - Posted: 7 Jan 2016, 20:38:30 UTC - in response to Message 1570. Well done?? I got job 35 from the new batch, but 300 events should be processed. https://cmsweb.cern.ch/crabcache --jobNumber=35 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_35.txt --runAndLumis=job_lumis_35.json --lheInputFiles=False --firstEvent=10201 --firstLumi=103 --lastEvent=10501 ID: 1571 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1572 - Posted: 7 Jan 2016, 21:05:47 UTC - in response to Message 1571. I got job 35 from the new batch, but 300 events should be processed. https://cmsweb.cern.ch/crabcache --jobNumber=35 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_35.txt --runAndLumis=job_lumis_35.json --lheInputFiles=False --firstEvent=10201 --firstLumi=103 --lastEvent=10501 Hmm, you're right -- fat-fingered me deleted the 3 and then typed in 3 again instead of 2, it seems. Oh, well, we can handle it. Today I sent private e-mails to the Volunteers with the most stage-out errors in the last batch, asking them to check their upload bandwidth utilisation; I hope most will see and act. ID: 1572 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1573 - Posted: 7 Jan 2016, 22:46:04 UTC asking them to check their upload bandwidth utilization; If it is a timeout issue, can't you just increase it? How long is it currently? And if you have multiple cms-computers on the same modem, how do you stop them uploading at the same time?(unless you do some serious baby-sitting) ID: 1573 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1574 - Posted: 8 Jan 2016, 0:04:13 UTC - in response to Message 1573. Last modified: 8 Jan 2016, 0:07:14 UTC asking them to check their upload bandwidth utilization; If it is a timeout issue, can't you just increase it? How long is it currently? Stage-out time-out is one hour. But the time-out only comes into play because of congestion. Current jobs (last batch and this one) produce ~150 MB of results. If you have a 1 Mbps ADSL upload link (like I do, some of our Volunteers report less), that takes ~25 minutes to upload. I think the average time for these jobs is ~3 hours; it could be less -- median time for the 51 completed jobs in the new batch is 1h17', range 38' to 2h39' (biased to shorter times since we've just started). Ergo, needing 25 mins to upload every 3 hours is no problem. But it means that on average you can only return about 7 results at a time before the link saturates -- and then jobs don't return within whatever time-out you set, they just don't fit the pipe. And if you have multiple cms-computers on the same modem, how do you stop them uploading at the same time?(unless you do some serious baby-sitting) Uploading at the same time shouldn't be a problem, back-off protocols, etc., take care of that, but you just can't fit a load averaging 1.5 Mbps through a pipe limited to 1 Mbps. ID: 1574 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1575 - Posted: 8 Jan 2016, 0:15:30 UTC - in response to Message 1574. I was out in my estimates. Median time for the last batch was 1h25'; range discarding absurd Dashboard outliers was ~30m - 20h. So the problem is twice as severe as I thought. ID: 1575 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 173	Message 1621 - Posted: 23 Jan 2016, 13:45:52 UTC - in response to Message 1574. Failures are still dominated by "stageout, 60311 failures", what's more, it doesn't look as though things are improving, rather the reverse. Are we still choking on our upstream ADSL capacity? Short of restricting the number of tasks per user - that might not be favourably received by volunteers not using ADSL; or somehow having jobs with different upload requirements that can be matched to users. I think that your BOINC server knows (or could know) hosts' upload speeds. That sounds a bit complicated.... Or do we have to just live with it? ID: 1621 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1622 - Posted: 23 Jan 2016, 14:48:33 UTC - in response to Message 1621. According to the failure list, the dominant error is 8001,which contributes over half of all failures. ID: 1622 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1623 - Posted: 23 Jan 2016, 22:12:56 UTC There should really be a measure in place to stop certain machines to keep producing errors. Again there are a couple that produce error after error and nothing is done to stop them( block them). There does not seem to be a point in running batch after batch hoping, that somehow the error rate will go down. Not even talking about the "stage out" errors. ID: 1623 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 173	Message 1624 - Posted: 23 Jan 2016, 23:42:33 UTC - in response to Message 1622. Last modified: 24 Jan 2016, 0:00:07 UTC According to the failure list, the dominant error is 8001,which contributes over half of all failures. You're probably looking at a longer period than I was. Over the last fortnight I see (Disclaimer...from Dashboard, as I write) Total success 543 fails 161 Error 8002 1 8016 2 50664 1 60311 148 80000 9 It's not easy for volunteers to see how well (or badly) their hosts are doing. It could help a lot if a way could be provided for us to do this. Perhaps by means of separate results for VM jobs, as T4T do (volunteers would have to check these); perhaps by failing the BOINC task in some way, such as a "computation error". Maybe draconian, but this would (should) be obvious to the user and the normal BOINC process of dealing with failing hosts could be used. This problem is really common to VMprojects in general and so may be something for BOINC rather than CMS. ID: 1624 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1625 - Posted: 24 Jan 2016, 0:23:08 UTC - in response to Message 1624. You are correct. I looked at dashboards individual failed tasks for the current batch. 1.Volunteers do not know if,when and how failures occur.(Therefore cannot take corrective action) 2.Failing hosts are not stopped automatically. 3.Stage out failures are not fixed and no (workable) solution in sight. The obvious thing to do is to address the biggest group of failures first. (stage out and failing hosts) perhaps by failing the BOINC task in some way, such as a "computation error" I agree.I suggested something similar a while ago.Maybe not at the first error, but maybe after three. There are simply too many variables at play to make it reliable. For the volunteers some sort of feedback is essential. ID: 1625 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1626 - Posted: 24 Jan 2016, 18:57:58 UTC - in response to Message 1625. Last modified: 24 Jan 2016, 18:58:17 UTC I agree with these points. Unfortunately I can't implement them myself, only make suggestions. I've worked up a script that tallies the exit status for all the jobs that produce log files, for a given batch. Unfortunately I can't publish the output as it contains personally-identifying information (UserID and HostID) which would be contrary to UK law, at least (insofar as I understand it). Also unfortunately, I can't see how to make the info available on the Web on an individual basis, the air-gap between the Condor server and the BOINC server is rather substantial. I'm definitely coming around to the idea of one job per task, with the exit code being used for valid/invalid -- i.e. no credit for jobs ending in error. ID: 1626 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1627 - Posted: 24 Jan 2016, 19:24:58 UTC I'm definitely coming around to the idea of one job per task, with the exit code being used for valid/invalid -- i.e. no credit for jobs ending in error. Sounds good. However the vm would have to sit on standby. As of now, just starting up to process the 1st event takes in my case 23min.(starting a new cms-task). ID: 1627 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1633 - Posted: 25 Jan 2016, 19:37:06 UTC What are the plans, as the current batch is coming to the end? More of the same or a major rebuild? ID: 1633 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 1634 - Posted: 26 Jan 2016, 0:28:02 UTC - in response to Message 1633. Last modified: 26 Jan 2016, 0:31:04 UTC What are the plans, as the current batch is coming to the end? More of the same or a major rebuild? I'm ready to submit another batch of 10,000 Minimum Bias jobs just before I go to bed -- which might depend on how tricky today's Guardian Cryptic Crossword is... I've cut it back to 150 events this time (and double-checked! :-) to see what effect that has on the "151" failure rates. We seem to have had a systemic increase in failures from mid-Friday until Sunday, but I didn't find any particular host/volunteer/reason to chalk it up to. Overall, Dashboard reckons on ultimate failure rate of 3% (after retries) so we're doing rather better than ATLAS by that metric. (OTOH, they have 10,000 jobs out at once!) So, new batch in an hour or so. No major rebuild just yet, but Laurence is working on some improvements. The new batch should take about a week or so, we'll see if there's anything new come up the turnpike by then. ID: 1634 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 1635 - Posted: 26 Jan 2016, 0:43:34 UTC We seem to have had a systemic increase in failures from mid-Friday until Sunday, I think, that there is an automatic increase in failures towards the end of a batch as inherently faulty job are passed through the 3 attempts to finally fail. There are 126 failures that went through all 3 attempts, a hand full 2 attempts and the rest(about 165) failed at the 1st attempt. What is 151? An exit code? ID: 1635 · Rating: 0 · rate: / Reply Quote

Development for LHC@home