Message boards : News : No new jobs
Message board moderation
Previous · 1 . . . 4 · 5 · 6 · 7 · 8 · 9 · 10 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Jobs are starting to run low. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Jobs are starting to run low. Yes, I'll submit more later today -- smaller again to try to reduce the time-out problems we have been seeing. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Jobs are starting to run low. This may take some time: "Failed to submit task 160107_181338:ireid_crab_CMS_at_Home_MinBias_200ev; 'Failure when submitting task to scheduler. Error reason: '''" ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Well done, Ivan! Just in time for the end of the batch. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Well done, Ivan! Yeah, I tried several times at work with no luck. Got out the side gate of Uni at 7pm just as Security was preparing to lock it (saving an extra 60 or 70% on my walk home) and finally got it to go through from home with only 4 or 5 jobs still in the queue. ![]() |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 33 ![]() ![]() |
Well done?? I got job 35 from the new batch, but 300 events should be processed. https://cmsweb.cern.ch/crabcache --jobNumber=35 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_35.txt --runAndLumis=job_lumis_35.json --lheInputFiles=False --firstEvent=10201 --firstLumi=103 --lastEvent=10501 |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I got job 35 from the new batch, but 300 events should be processed. Hmm, you're right -- fat-fingered me deleted the 3 and then typed in 3 again instead of 2, it seems. Oh, well, we can handle it. Today I sent private e-mails to the Volunteers with the most stage-out errors in the last batch, asking them to check their upload bandwidth utilisation; I hope most will see and act. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
asking them to check their upload bandwidth utilization; If it is a timeout issue, can't you just increase it? How long is it currently? And if you have multiple cms-computers on the same modem, how do you stop them uploading at the same time?(unless you do some serious baby-sitting) |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
asking them to check their upload bandwidth utilization; Stage-out time-out is one hour. But the time-out only comes into play because of congestion. Current jobs (last batch and this one) produce ~150 MB of results. If you have a 1 Mbps ADSL upload link (like I do, some of our Volunteers report less), that takes ~25 minutes to upload. I think the average time for these jobs is ~3 hours; it could be less -- median time for the 51 completed jobs in the new batch is 1h17', range 38' to 2h39' (biased to shorter times since we've just started). Ergo, needing 25 mins to upload every 3 hours is no problem. But it means that on average you can only return about 7 results at a time before the link saturates -- and then jobs don't return within whatever time-out you set, they just don't fit the pipe. And if you have multiple cms-computers on the same modem, how do you stop them uploading at the same time?(unless you do some serious baby-sitting) Uploading at the same time shouldn't be a problem, back-off protocols, etc., take care of that, but you just can't fit a load averaging 1.5 Mbps through a pipe limited to 1 Mbps. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I was out in my estimates. Median time for the last batch was 1h25'; range discarding absurd Dashboard outliers was ~30m - 20h. So the problem is twice as severe as I thought. ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
Failures are still dominated by "stageout, 60311 failures", what's more, it doesn't look as though things are improving, rather the reverse. Are we still choking on our upstream ADSL capacity? Short of restricting the number of tasks per user - that might not be favourably received by volunteers not using ADSL; or somehow having jobs with different upload requirements that can be matched to users. I think that your BOINC server knows (or could know) hosts' upload speeds. That sounds a bit complicated.... Or do we have to just live with it? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
According to the failure list, the dominant error is 8001,which contributes over half of all failures. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
There should really be a measure in place to stop certain machines to keep producing errors. Again there are a couple that produce error after error and nothing is done to stop them( block them). There does not seem to be a point in running batch after batch hoping, that somehow the error rate will go down. Not even talking about the "stage out" errors. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
According to the failure list, the dominant error is 8001,which contributes over half of all failures. You're probably looking at a longer period than I was. Over the last fortnight I see (Disclaimer...from Dashboard, as I write) Total success 543 fails 161 Error 8002 1 8016 2 50664 1 60311 148 80000 9 It's not easy for volunteers to see how well (or badly) their hosts are doing. It could help a lot if a way could be provided for us to do this. Perhaps by means of separate results for VM jobs, as T4T do (volunteers would have to check these); perhaps by failing the BOINC task in some way, such as a "computation error". Maybe draconian, but this would (should) be obvious to the user and the normal BOINC process of dealing with failing hosts could be used. This problem is really common to VMprojects in general and so may be something for BOINC rather than CMS. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
You are correct. I looked at dashboards individual failed tasks for the current batch. 1.Volunteers do not know if,when and how failures occur.(Therefore cannot take corrective action) 2.Failing hosts are not stopped automatically. 3.Stage out failures are not fixed and no (workable) solution in sight. The obvious thing to do is to address the biggest group of failures first. (stage out and failing hosts) perhaps by failing the BOINC task in some way, such as a "computation error" I agree.I suggested something similar a while ago.Maybe not at the first error, but maybe after three. There are simply too many variables at play to make it reliable. For the volunteers some sort of feedback is essential. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I agree with these points. Unfortunately I can't implement them myself, only make suggestions. I've worked up a script that tallies the exit status for all the jobs that produce log files, for a given batch. Unfortunately I can't publish the output as it contains personally-identifying information (UserID and HostID) which would be contrary to UK law, at least (insofar as I understand it). Also unfortunately, I can't see how to make the info available on the Web on an individual basis, the air-gap between the Condor server and the BOINC server is rather substantial. I'm definitely coming around to the idea of one job per task, with the exit code being used for valid/invalid -- i.e. no credit for jobs ending in error. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I'm definitely coming around to the idea of one job per task, with the exit code being used for valid/invalid -- i.e. no credit for jobs ending in error. Sounds good. However the vm would have to sit on standby. As of now, just starting up to process the 1st event takes in my case 23min.(starting a new cms-task). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
What are the plans, as the current batch is coming to the end? More of the same or a major rebuild? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
What are the plans, as the current batch is coming to the end? I'm ready to submit another batch of 10,000 Minimum Bias jobs just before I go to bed -- which might depend on how tricky today's Guardian Cryptic Crossword is... I've cut it back to 150 events this time (and double-checked! :-) to see what effect that has on the "151" failure rates. We seem to have had a systemic increase in failures from mid-Friday until Sunday, but I didn't find any particular host/volunteer/reason to chalk it up to. Overall, Dashboard reckons on ultimate failure rate of 3% (after retries) so we're doing rather better than ATLAS by that metric. (OTOH, they have 10,000 jobs out at once!) So, new batch in an hour or so. No major rebuild just yet, but Laurence is working on some improvements. The new batch should take about a week or so, we'll see if there's anything new come up the turnpike by then. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
We seem to have had a systemic increase in failures from mid-Friday until Sunday, I think, that there is an automatic increase in failures towards the end of a batch as inherently faulty job are passed through the 3 attempts to finally fail. There are 126 failures that went through all 3 attempts, a hand full 2 attempts and the rest(about 165) failed at the 1st attempt. What is 151? An exit code? |
©2025 CERN