Message boards :
CMS Application :
Ready For Production
Message board moderation
Previous · 1 · 2 · 3
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
OK, message back from my contact (it helps to have PhD students from your group that you helped through their theses...): This is a known issue with wmagent, the devs are aware of it and they are supposed to provide a fix. It's not just the activity name not being reported but it's all the task metadata unfortunately. So, we just have to be patient. There's no point in pushing it further. |
Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431 |
If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared. From your 10 test-jobs 3 succeeded, 2 running somewhere and still 5 pending. I didn't get a single one of those, but already some of the newest batch: 160626_161246:ireid_crab_CMS_at_Home_TTbar_50ev_prodJ |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared. Oh, good, something to look at when I get into the office. The latest batch is "security" to keep you guys going while I assess the B-physics results. |
Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431 |
Got three of your 46_B batch - Id's: 88, 95 and 96 150,000 events is creating a huge running.log file into the VM -> 29MB after 13,000 events. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 486 |
160627_195824:ireid_crab_BPH-RunIISummer15GS-00046_C Unless someone fixes the shutdown/resume process, I'm never going to finish one of these, so I'll put hosts to other things. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
Got three of your 46_B batch - Id's: 88, 95 and 96 Yes, I'm concerned about that. I'd turned on the "return log file" option and was gobsmacked to see that the jobs returned 12 MB result files and 45 MB log.tar.gz files. Then I unpacked one of the tar files to find the log file expands to 390 MB! Not very good... Looks like the culprit is the event generator "PYTHIA" -- it seems it gets called from scratch(?) for every event and each invocation prints out a huge prologue complete with all copyright and legal references, an ASCII-art banner, etc. I've turned off returning log files, I'll have to see if there's a way to turn off the flag pages. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements). Apparently the problem is that PYTHIA does an init() after every 100 events (Aside: as an experienced -- not necessarily expert -- programmer, I have to wonder why. These "traditional" methods worry me, what real error are they sweeping under the carpet?). And, by default, the init() prints info about the generator state. Solution: turn off all init verbosity. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements). OK, I've managed to get the stdout down to 20 MB or less, but I've also dropped Nevents from 150,000 to 100,000 in the next batch. I'm not convinced that will affect our error rates, as initially one individual was responsible for nearly half our failures due to some problem reading files from /cvmfs -- looks like a network problem but at the moment I don't have time to go to Poland to sort it out... Of course, the inflated stdout files also affected the logs and we almost ran out of space on the Condor server again. I think I just noticed in time, there weren't any PostProc failures that I could see. |
Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431 |
Well done weekend work, Ivan! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
Well done weekend work, Ivan! Thanks. I'll be away Mon-Wed, but I believe Laurence is back tomorrow. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I looks like a few hosts are producing a large number of fails in the latest batch, again. Was there not a fix, that stops hosts to produce large number of fails? I guess,not, because the actual boinc tasks does not error out. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
I looks like a few hosts are producing a large number of fails in the latest batch, again. There's one user producing a majority of the fails, and another most of the rest. I'm not totally concerned about it for the present, because they are only hurting themselves, not the rest of the Volunteers (and I've got other things to do just at present). And it should not be a great concern to the physicist who requested this workflow -- it's this or nothing for some investigations; he's only getting 7/8ths of the submissions, but that's better than nothing out of nothing. I think some of the protections may have fallen out when the project was refactored to cope with several projects, or maybe they were not transferred to the standard project (both users are from the vLHC beta). With Laurence back tomorrow, we might find a better way to guard against these (possibly kill a task if it turns up an error 65 job-result?). I'll try to get a summary to him before I start packing tonight. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. Have a save trip. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172 |
Sorry about the glitch tonight, I'd submitted a small batch to see what effect another attempt at reducing Pythia's verbosity had made, and set up a small cron job to compress the log files returned to the Condor server -- they were depleting free space on the /home partition more than "usual". Then I got distracted by travel, and a request to help test a new Compute Element in our Grid farm. There were a lot of Condor jobs waiting for input by the time I noticed the well was dry. :-) |
©2024 CERN