Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I think, there are way to many -151 stage-out errors. I would also suggest to compress the results, before uploading.This would save a lot of time and bandwidth.I am certain, that a compression to 5% is easily achievable, as there is a lot of repeating text in the logs. A little time spent to optimize/fix this would make processing performance a lot better. Ever second saved in optimizing the image file would be multiplied 1000 fold in the processing time saved on the volunteers machines. (I am sure, you know that, but i just wanted to mention it) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Unfortunately, things at CERN are very confused at the moment. There was a large network problem last Friday, which may have been implicated in our problems starting then. Because of, or concurrent with, that the CRAB system became overloaded. This was eventually cleared, but became congested again yesterday. I think that was the reason we had so many jobs flagged as failures yesterday. I traced a couple, one of which was a job I'd processed at Brunel, and they'd finished and staged out OK, but the post-processing step encountered an error when querying the AOS database (the system that transfers result files after jobs). In our case we don't need it as we transfer directly to the data-bridge, but it appears that the AOS DB, or its server, couldn't cope with demand and we got errors reported back to Condor and Dashboard, with the result that the jobs were requeued in Condor even though they had in fact finished correctly. This is why there's a spike in "failures" in the graphs, but not a large number in Dashboard, as Dashboard moves jobs out of Failure to ToRetry to Pending as the jobs are requeued. Oh, and Dashboard went down for a couple of hours this afternoon. I've not yet heard why, but that's the reason for a gap in our graphs. It's hard to see what we can do in this case, except wait for things to clear. I guess the problem is so many people trying to reprocess the whole year's data now that pp collisions have stopped (I believe we started with pPb proton-ion collisions today). Please have patience, I'm sure there are things we can do to improve our performance but it's hard to pinpoint them in a time of turbulence. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. I have noticed, that there are some major problems. Although,it might seem a good thing to have the entire computing-system consolidated, the drawback is, if certain key components go down, the whole system is affected. I am getting a very large number of 151 failures, so i will suspend computing for a while, until things have cleared-up a bit. I also noticed, that communication has gone from bad to worse in the past month or so(except you). It would be nice, if admin could post a few lines(along the line "...we nave a major server restructuring a hand, so communication will be minimal..." , otherwise volunteers might get the impression, they to not care at all(There will always be enough i***** to contribute). BTW. the retry count on dashboard is going trough the roof There are errors/problems reported by volunteers, which are competently ignored. At some point volunteers give up and think, why bother. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
At the moment it feels like we're putting out the fire with gasoline... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I think, there are way to many -151 stage-out errors. As a matter of fact, we don't upload the logs per se -- I quickly learnt not to do that when we first started on this workflow. In fact the logs were compressed, and were still ~40 MB, but when I uncompressed one it was over 600 MB! (IIRC) That's when I got the Pythia experts in and found out how to reduce the pythia output as much as possible, but there's still a certain amount of output every 100 events that I haven't found out how to reduce, so the on-disk log is still a bit large. However, CRAB only returns the first 1000 lines and the last 3000 lines of the logs to the condor server, that's currently running at ~400 kB/log and you are right, it compresses marvellously since it's mainly the same thing over and over again. I actually run a cron job on the server every hour to compress all the job_out.*.txt files over 100 kB. Anything smaller than that is likely an error log, and the placeholder files before a job finishes are 47 bytes. So the stage-out errors concern the upload of the result .root file to the data bridge. These are already binary and compressed, and currently run about 16 MB. I haven't got as far as checking what went wrong last night, but the previous night's failures were actually failures in the post-processing, not in the job itself, due to overload in the CRAB servers as far as I can tell. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I think I understand the most-recent batch of 151 errors. On Thursday night there were, as already mentioned, jobs that finished properly being flagged as bad in the post-processing due to communications problems at CERN. These jobs were re-queued and eventually re-run. However, when the job ran a second time, at stage-out it found that a result file already existed on the data-bridge. Since we don't have a flag set to allow over-write (guess I need to look into that...) the stage-out threw a wobbly, set an error code and deleted the existing file! So, condor has set these jobs up for a 2nd re-try, which should mostly succeed as there is now no result file on the data bridge, but it's a far from satisfactory situation. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the detailed response. Is there a way to update the guest-additions inside the image from the host? Atlas is reporting significant speedups with vbox version >5.0 for multi-core operation, so it would probably speedup things a lot, if the guest would also have a matching (5.1.8 in my case) version of guest-addition. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I guess that's a question for the experts like Laurence or Nils. In the meantime I'm locked out of the Condor server due to an expired Certificate Revokation List -- I think. [eesridr@pion:~] > gsissh -p 9700 lcggwms02.gridpp.rl.ac.uk GSSAPI Error: GSS Major Status: Authentication Failed GSS Minor Status Error Chain: globus_gsi_gssapi: SSLv3 handshake problems globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Could not verify credential globus_gsi_callback_module: Invalid CRL: The available CRL has expired Hopefully I can fix this with Andrew's help tomorrow morning; we aren't due to run out of jobs just yet, but the Dashboard figures are extremely unreliable at the moment. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Is there a way to update the guest-additions inside the image from the host? We can look into updating it. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Andrew restarted the automatic CRL updates, and I can log into the condor server again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Something is really wrong. 4400 jobs running, none finished- according to dashboard???? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Something is really wrong. One of our major contributors appears to have pulled out of CMS@Home (it is a large development cluster at CERN; I guess he has other work for it). Oh, I misread that as 440 at first; now I see your point. I'm not sure what Dashboard page you look at, but there have been severe overloads on Grid submissions lately and Dashboard is suffering too. Ah, yes, I see that Dashboard thinks there are 4,000 jobs running from the newest batch. In reality at the HTCondor server I see: [cms005@lcggwms02:~] > ./stats.sh 161129_200356:ireid_crab_BPH-RunIISummer15GS-00046_BB 9000 NodeStatus = 1; /* "STATUS_READY" */ 1000 NodeStatus = 3; /* "STATUS_SUBMITTED" */ i.e. 9,000 are ready, 1,000 are queued, but none are running yet. The "current" batch has
4 NodeStatus = 4; /* "STATUS_POSTRUN" */ 8933 NodeStatus = 5; /* "STATUS_DONE" */ 206 NodeStatus = 6; /* "STATUS_ERROR" */ |
Send message Joined: 18 Aug 15 Posts: 14 Credit: 125,335 RAC: 0 |
I just tried one of the multi-thread tasks and got this: 2016-11-30 09:02:18 (15296): Guest Log: [DEBUG] HTCondor ping |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I just tried one of the multi-thread tasks and got this: I get that myself, but Laurence is quite busy on other things at the moment. Last time he tried, it worked. :-/ |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
I just tried one of the multi-thread tasks and got this: Actually... Can you please post your app_config.xml so that the more-experienced volunteers can have a chance to critique it? Thanks. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
It looks like you used 3 cores and 2048MB of memory. That is not enough for 3 cores.You might want to try 2 cores and 2560MB memory or 3 cores with 3328MB. There are still no exact memory figures available for any number of cores. I suggest the formula MEMORY(MB)=1024MB+n*768MB (no guarantees) n is the number of cores used per task. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
You should not need this forumla or the app_config.xml. Multicore should now work out of the box, you just need to specify the number of cores to use in the project preferences. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
captainjack had 3 cores and 2048MB specified. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=290215 Unless he was using an app_config, it appears not to be working. BTW selecting anything other than 1 core per task is not working, and has not been working for several weeks. It only runs single core tasks,no matter what the settings are. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
On the server the plan class specifies 1GB base memory with 896 per core. If this is not working then it could be that job itself is requesting 2GB. This is something that Ivan can alter. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
As i mentioned before: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=306&postid=4301#4301 NO MATTER WHAT THE "Max # of CPUs for this project" SETTINGS ARE, I ALWAYS GET A 1 CORE 1896MB TASK. (when not using an app_config.xml) |
©2024 CERN