Message boards : CMS Application : Ready For Production
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3611 - Posted: 25 Jun 2016, 21:59:54 UTC - in response to Message 3608.  

OK, message back from my contact (it helps to have PhD students from your group that you helped through their theses...):
This is a known issue with wmagent, the devs are aware of it and they are supposed to provide a fix. It's not just the activity name not being reported but it's all the task metadata unfortunately.

So, we just have to be patient. There's no point in pushing it further.
ID: 3611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 3612 - Posted: 27 Jun 2016, 5:44:52 UTC - in response to Message 3605.  

If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared.

From your 10 test-jobs 3 succeeded, 2 running somewhere and still 5 pending.
I didn't get a single one of those, but already some of the newest batch: 160626_161246:ireid_crab_CMS_at_Home_TTbar_50ev_prodJ
ID: 3612 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3613 - Posted: 27 Jun 2016, 7:22:52 UTC - in response to Message 3612.  

If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared.

From your 10 test-jobs 3 succeeded, 2 running somewhere and still 5 pending.
I didn't get a single one of those, but already some of the newest batch: 160626_161246:ireid_crab_CMS_at_Home_TTbar_50ev_prodJ

Oh, good, something to look at when I get into the office. The latest batch is "security" to keep you guys going while I assess the B-physics results.
ID: 3613 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 3614 - Posted: 27 Jun 2016, 20:08:02 UTC - in response to Message 3613.  

Got three of your 46_B batch - Id's: 88, 95 and 96

150,000 events is creating a huge running.log file into the VM -> 29MB after 13,000 events.
ID: 3614 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 486
Message 3615 - Posted: 27 Jun 2016, 21:29:15 UTC - in response to Message 3614.  

160627_195824:ireid_crab_BPH-RunIISummer15GS-00046_C

Unless someone fixes the shutdown/resume process, I'm never going to finish one of these, so I'll put hosts to other things.
ID: 3615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3616 - Posted: 27 Jun 2016, 21:50:47 UTC - in response to Message 3614.  
Last modified: 27 Jun 2016, 21:52:41 UTC

Got three of your 46_B batch - Id's: 88, 95 and 96

150,000 events is creating a huge running.log file into the VM -> 29MB after 13,000 events.

Yes, I'm concerned about that. I'd turned on the "return log file" option and was gobsmacked to see that the jobs returned 12 MB result files and 45 MB log.tar.gz files. Then I unpacked one of the tar files to find the log file expands to 390 MB! Not very good...
Looks like the culprit is the event generator "PYTHIA" -- it seems it gets called from scratch(?) for every event and each invocation prints out a huge prologue complete with all copyright and legal references, an ASCII-art banner, etc.
I've turned off returning log files, I'll have to see if there's a way to turn off the flag pages.
ID: 3616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3620 - Posted: 30 Jun 2016, 19:30:21 UTC - in response to Message 3616.  

The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements).
Apparently the problem is that PYTHIA does an init() after every 100 events (Aside: as an experienced -- not necessarily expert -- programmer, I have to wonder why. These "traditional" methods worry me, what real error are they sweeping under the carpet?). And, by default, the init() prints info about the generator state. Solution: turn off all init verbosity.
ID: 3620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3623 - Posted: 2 Jul 2016, 17:57:12 UTC - in response to Message 3620.  
Last modified: 2 Jul 2016, 17:59:23 UTC

The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements).
Apparently the problem is that PYTHIA does an init() after every 100 events (Aside: as an experienced -- not necessarily expert -- programmer, I have to wonder why. These "traditional" methods worry me, what real error are they sweeping under the carpet?). And, by default, the init() prints info about the generator state. Solution: turn off all init verbosity.

OK, I've managed to get the stdout down to 20 MB or less, but I've also dropped Nevents from 150,000 to 100,000 in the next batch. I'm not convinced that will affect our error rates, as initially one individual was responsible for nearly half our failures due to some problem reading files from /cvmfs -- looks like a network problem but at the moment I don't have time to go to Poland to sort it out...

Of course, the inflated stdout files also affected the logs and we almost ran out of space on the Condor server again. I think I just noticed in time, there weren't any PostProc failures that I could see.
ID: 3623 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 3624 - Posted: 3 Jul 2016, 6:43:45 UTC

Well done weekend work, Ivan!
ID: 3624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3625 - Posted: 3 Jul 2016, 9:19:19 UTC - in response to Message 3624.  

Well done weekend work, Ivan!

Thanks. I'll be away Mon-Wed, but I believe Laurence is back tomorrow.
ID: 3625 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3626 - Posted: 3 Jul 2016, 10:19:05 UTC

I looks like a few hosts are producing a large number of fails in the latest batch, again.
Was there not a fix, that stops hosts to produce large number of fails?
I guess,not, because the actual boinc tasks does not error out.
ID: 3626 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3627 - Posted: 3 Jul 2016, 14:12:20 UTC - in response to Message 3626.  

I looks like a few hosts are producing a large number of fails in the latest batch, again.
Was there not a fix, that stops hosts to produce large number of fails?
I guess,not, because the actual boinc tasks does not error out.

There's one user producing a majority of the fails, and another most of the rest. I'm not totally concerned about it for the present, because they are only hurting themselves, not the rest of the Volunteers (and I've got other things to do just at present). And it should not be a great concern to the physicist who requested this workflow -- it's this or nothing for some investigations; he's only getting 7/8ths of the submissions, but that's better than nothing out of nothing.

I think some of the protections may have fallen out when the project was refactored to cope with several projects, or maybe they were not transferred to the standard project (both users are from the vLHC beta). With Laurence back tomorrow, we might find a better way to guard against these (possibly kill a task if it turns up an error 65 job-result?). I'll try to get a summary to him before I start packing tonight.
ID: 3627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3628 - Posted: 3 Jul 2016, 16:27:30 UTC - in response to Message 3627.  

Thanks, Ivan.
Have a save trip.
ID: 3628 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 3637 - Posted: 6 Jul 2016, 20:00:33 UTC - in response to Message 3628.  

Sorry about the glitch tonight, I'd submitted a small batch to see what effect another attempt at reducing Pythia's verbosity had made, and set up a small cron job to compress the log files returned to the Condor server -- they were depleting free space on the /home partition more than "usual". Then I got distracted by travel, and a request to help test a new Compute Element in our Grid farm. There were a lot of Condor jobs waiting for input by the time I noticed the well was dry. :-)
ID: 3637 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : CMS Application : Ready For Production


©2024 CERN