Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4337 - Posted: 18 Nov 2016, 20:29:15 UTC

I think, there are way to many -151 stage-out errors.

I would also suggest to compress the results, before uploading.This would save a lot of time and bandwidth.I am certain, that a compression to 5% is easily achievable, as there is a lot of repeating text in the logs.

A little time spent to optimize/fix this would make processing performance a lot better.

Ever second saved in optimizing the image file would be multiplied 1000 fold in the processing time saved on the volunteers machines.

(I am sure, you know that, but i just wanted to mention it)
ID: 4337 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4338 - Posted: 18 Nov 2016, 22:15:41 UTC - in response to Message 4337.  
Last modified: 18 Nov 2016, 22:19:04 UTC

Unfortunately, things at CERN are very confused at the moment. There was a large network problem last Friday, which may have been implicated in our problems starting then. Because of, or concurrent with, that the CRAB system became overloaded. This was eventually cleared, but became congested again yesterday. I think that was the reason we had so many jobs flagged as failures yesterday. I traced a couple, one of which was a job I'd processed at Brunel, and they'd finished and staged out OK, but the post-processing step encountered an error when querying the AOS database (the system that transfers result files after jobs). In our case we don't need it as we transfer directly to the data-bridge, but it appears that the AOS DB, or its server, couldn't cope with demand and we got errors reported back to Condor and Dashboard, with the result that the jobs were requeued in Condor even though they had in fact finished correctly. This is why there's a spike in "failures" in the graphs, but not a large number in Dashboard, as Dashboard moves jobs out of Failure to ToRetry to Pending as the jobs are requeued.

Oh, and Dashboard went down for a couple of hours this afternoon. I've not yet heard why, but that's the reason for a gap in our graphs.

It's hard to see what we can do in this case, except wait for things to clear. I guess the problem is so many people trying to reprocess the whole year's data now that pp collisions have stopped (I believe we started with pPb proton-ion collisions today). Please have patience, I'm sure there are things we can do to improve our performance but it's hard to pinpoint them in a time of turbulence.
ID: 4338 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4339 - Posted: 19 Nov 2016, 9:34:47 UTC - in response to Message 4338.  
Last modified: 19 Nov 2016, 9:39:23 UTC

Thanks, Ivan.
I have noticed, that there are some major problems.
Although,it might seem a good thing to have the entire computing-system consolidated, the drawback is, if certain key components go down, the whole system is affected.
I am getting a very large number of 151 failures, so i will suspend computing for a while, until things have cleared-up a bit.

I also noticed, that communication has gone from bad to worse in the past month or so(except you). It would be nice, if admin could post a few lines(along the line "...we nave a major server restructuring a hand, so communication will be minimal..." , otherwise volunteers might get the impression, they to not care at all(There will always be enough i***** to contribute).

BTW. the retry count on dashboard is going trough the roof

There are errors/problems reported by volunteers, which are competently ignored.
At some point volunteers give up and think, why bother.
ID: 4339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4340 - Posted: 19 Nov 2016, 10:57:50 UTC - in response to Message 4339.  

At the moment it feels like we're putting out the fire with gasoline...
ID: 4340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4341 - Posted: 19 Nov 2016, 11:18:20 UTC - in response to Message 4337.  

I think, there are way to many -151 stage-out errors.

I would also suggest to compress the results, before uploading.This would save a lot of time and bandwidth.I am certain, that a compression to 5% is easily achievable, as there is a lot of repeating text in the logs.

A little time spent to optimize/fix this would make processing performance a lot better.

Ever second saved in optimizing the image file would be multiplied 1000 fold in the processing time saved on the volunteers machines.

(I am sure, you know that, but i just wanted to mention it)

As a matter of fact, we don't upload the logs per se -- I quickly learnt not to do that when we first started on this workflow. In fact the logs were compressed, and were still ~40 MB, but when I uncompressed one it was over 600 MB! (IIRC) That's when I got the Pythia experts in and found out how to reduce the pythia output as much as possible, but there's still a certain amount of output every 100 events that I haven't found out how to reduce, so the on-disk log is still a bit large. However, CRAB only returns the first 1000 lines and the last 3000 lines of the logs to the condor server, that's currently running at ~400 kB/log and you are right, it compresses marvellously since it's mainly the same thing over and over again. I actually run a cron job on the server every hour to compress all the job_out.*.txt files over 100 kB. Anything smaller than that is likely an error log, and the placeholder files before a job finishes are 47 bytes.

So the stage-out errors concern the upload of the result .root file to the data bridge. These are already binary and compressed, and currently run about 16 MB. I haven't got as far as checking what went wrong last night, but the previous night's failures were actually failures in the post-processing, not in the job itself, due to overload in the CRAB servers as far as I can tell.
ID: 4341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4342 - Posted: 19 Nov 2016, 11:46:21 UTC - in response to Message 4341.  

I think I understand the most-recent batch of 151 errors. On Thursday night there were, as already mentioned, jobs that finished properly being flagged as bad in the post-processing due to communications problems at CERN.
These jobs were re-queued and eventually re-run. However, when the job ran a second time, at stage-out it found that a result file already existed on the data-bridge. Since we don't have a flag set to allow over-write (guess I need to look into that...) the stage-out threw a wobbly, set an error code and deleted the existing file!
So, condor has set these jobs up for a 2nd re-try, which should mostly succeed as there is now no result file on the data bridge, but it's a far from satisfactory situation.
ID: 4342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4343 - Posted: 19 Nov 2016, 17:42:20 UTC - in response to Message 4342.  

Thanks for the detailed response.

Is there a way to update the guest-additions inside the image from the host?

Atlas is reporting significant speedups with vbox version >5.0 for multi-core operation, so it would probably speedup things a lot, if the guest would also have a matching (5.1.8 in my case) version of guest-addition.
ID: 4343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4344 - Posted: 20 Nov 2016, 23:58:04 UTC - in response to Message 4343.  

I guess that's a question for the experts like Laurence or Nils.
In the meantime I'm locked out of the Condor server due to an expired Certificate Revokation List -- I think.

[eesridr@pion:~] > gsissh -p 9700 lcggwms02.gridpp.rl.ac.uk
GSSAPI Error:
GSS Major Status: Authentication Failed

GSS Minor Status Error Chain:
globus_gsi_gssapi: SSLv3 handshake problems
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Could not verify credential
globus_gsi_callback_module: Invalid CRL: The available CRL has expired


Hopefully I can fix this with Andrew's help tomorrow morning; we aren't due to run out of jobs just yet, but the Dashboard figures are extremely unreliable at the moment.
ID: 4344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 4348 - Posted: 21 Nov 2016, 21:41:22 UTC - in response to Message 4343.  

Is there a way to update the guest-additions inside the image from the host?


We can look into updating it.
ID: 4348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4349 - Posted: 22 Nov 2016, 8:44:30 UTC - in response to Message 4344.  

Andrew restarted the automatic CRL updates, and I can log into the condor server again.
ID: 4349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4364 - Posted: 30 Nov 2016, 13:11:36 UTC
Last modified: 30 Nov 2016, 13:14:19 UTC

Something is really wrong.

4400 jobs running, none finished- according to dashboard????
ID: 4364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4365 - Posted: 30 Nov 2016, 14:55:40 UTC - in response to Message 4364.  

Something is really wrong.

4400 jobs running, none finished- according to dashboard????

One of our major contributors appears to have pulled out of CMS@Home (it is a large development cluster at CERN; I guess he has other work for it). Oh, I misread that as 440 at first; now I see your point.
I'm not sure what Dashboard page you look at, but there have been severe overloads on Grid submissions lately and Dashboard is suffering too. Ah, yes, I see that Dashboard thinks there are 4,000 jobs running from the newest batch. In reality at the HTCondor server I see:

[cms005@lcggwms02:~] > ./stats.sh 161129_200356:ireid_crab_BPH-RunIISummer15GS-00046_BB
9000 NodeStatus = 1; /* "STATUS_READY" */
1000 NodeStatus = 3; /* "STATUS_SUBMITTED" */


i.e. 9,000 are ready, 1,000 are queued, but none are running yet.

The "current" batch has
    857 NodeStatus = 3; /* "STATUS_SUBMITTED" */
    4 NodeStatus = 4; /* "STATUS_POSTRUN" */
    8933 NodeStatus = 5; /* "STATUS_DONE" */
    206 NodeStatus = 6; /* "STATUS_ERROR" */

so 8,933 are supposed to have been successful, 206 failed, 4 not yet fully processed, and 857 are in the queue, out of which condor_status tells me 403 are actually running.


ID: 4365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 18 Aug 15
Posts: 14
Credit: 125,335
RAC: 0
Message 4366 - Posted: 30 Nov 2016, 15:19:23 UTC

I just tried one of the multi-thread tasks and got this:

2016-11-30 09:02:18 (15296): Guest Log: [DEBUG] HTCondor ping
2016-11-30 09:02:28 (15296): Guest Log: [DEBUG] 0
2016-11-30 09:12:59 (15296): Guest Log: [ERROR] Condor exited after 637s without running a job.
2016-11-30 09:12:59 (15296): Guest Log: [INFO] Shutting Down.
ID: 4366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4367 - Posted: 30 Nov 2016, 17:31:13 UTC - in response to Message 4366.  

I just tried one of the multi-thread tasks and got this:

2016-11-30 09:02:18 (15296): Guest Log: [DEBUG] HTCondor ping
2016-11-30 09:02:28 (15296): Guest Log: [DEBUG] 0
2016-11-30 09:12:59 (15296): Guest Log: [ERROR] Condor exited after 637s without running a job.
2016-11-30 09:12:59 (15296): Guest Log: [INFO] Shutting Down.

I get that myself, but Laurence is quite busy on other things at the moment. Last time he tried, it worked. :-/
ID: 4367 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1138
Credit: 8,067,191
RAC: 2,972
Message 4368 - Posted: 30 Nov 2016, 22:06:17 UTC - in response to Message 4366.  

I just tried one of the multi-thread tasks and got this:

2016-11-30 09:02:18 (15296): Guest Log: [DEBUG] HTCondor ping
2016-11-30 09:02:28 (15296): Guest Log: [DEBUG] 0
2016-11-30 09:12:59 (15296): Guest Log: [ERROR] Condor exited after 637s without running a job.
2016-11-30 09:12:59 (15296): Guest Log: [INFO] Shutting Down.

Actually... Can you please post your app_config.xml so that the more-experienced volunteers can have a chance to critique it? Thanks.
ID: 4368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4369 - Posted: 1 Dec 2016, 8:16:06 UTC - in response to Message 4366.  
Last modified: 1 Dec 2016, 8:17:59 UTC

It looks like you used 3 cores and 2048MB of memory. That is not enough for 3 cores.You might want to try 2 cores and 2560MB memory or 3 cores with 3328MB.

There are still no exact memory figures available for any number of cores.

I suggest the formula MEMORY(MB)=1024MB+n*768MB (no guarantees)
n is the number of cores used per task.
ID: 4369 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 4371 - Posted: 1 Dec 2016, 8:30:21 UTC - in response to Message 4369.  


I suggest the formula MEMORY(MB)=1024MB+n*768MB (no guarantees)
n is the number of cores used per task.


You should not need this forumla or the app_config.xml. Multicore should now work out of the box, you just need to specify the number of cores to use in the project preferences.
ID: 4371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4372 - Posted: 1 Dec 2016, 10:00:32 UTC - in response to Message 4371.  
Last modified: 1 Dec 2016, 10:08:59 UTC

captainjack had 3 cores and 2048MB specified.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=290215

Unless he was using an app_config, it appears not to be working.

BTW selecting anything other than 1 core per task is not working, and has not been working for several weeks.
It only runs single core tasks,no matter what the settings are.
ID: 4372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 2
Message 4374 - Posted: 1 Dec 2016, 10:42:14 UTC - in response to Message 4372.  

On the server the plan class specifies 1GB base memory with 896 per core. If this is not working then it could be that job itself is requesting 2GB. This is something that Ivan can alter.
ID: 4374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4375 - Posted: 1 Dec 2016, 11:07:31 UTC - in response to Message 4374.  
Last modified: 1 Dec 2016, 11:08:40 UTC

As i mentioned before:

http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=306&postid=4301#4301

NO MATTER WHAT THE "Max # of CPUs for this project" SETTINGS ARE, I ALWAYS GET A 1 CORE 1896MB TASK.

(when not using an app_config.xml)
ID: 4375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN