Ready For Production

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3611 - Posted: 25 Jun 2016, 21:59:54 UTC - in response to Message 3608. OK, message back from my contact (it helps to have PhD students from your group that you helped through their theses...): This is a known issue with wmagent, the devs are aware of it and they are supposed to provide a fix. It's not just the activity name not being reported but it's all the task metadata unfortunately. So, we just have to be patient. There's no point in pushing it further. ID: 3611 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 3612 - Posted: 27 Jun 2016, 5:44:52 UTC - in response to Message 3605. If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared. From your 10 test-jobs 3 succeeded, 2 running somewhere and still 5 pending. I didn't get a single one of those, but already some of the newest batch: 160626_161246:ireid_crab_CMS_at_Home_TTbar_50ev_prodJ ID: 3612 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3613 - Posted: 27 Jun 2016, 7:22:52 UTC - in response to Message 3612. If you manage to catch one (sometime early next week if the WMAgent backlog continues) do let me know how it fared. From your 10 test-jobs 3 succeeded, 2 running somewhere and still 5 pending. I didn't get a single one of those, but already some of the newest batch: 160626_161246:ireid_crab_CMS_at_Home_TTbar_50ev_prodJ Oh, good, something to look at when I get into the office. The latest batch is "security" to keep you guys going while I assess the B-physics results. ID: 3613 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 3614 - Posted: 27 Jun 2016, 20:08:02 UTC - in response to Message 3613. Got three of your 46_B batch - Id's: 88, 95 and 96 150,000 events is creating a huge running.log file into the VM -> 29MB after 13,000 events. ID: 3614 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 486	Message 3615 - Posted: 27 Jun 2016, 21:29:15 UTC - in response to Message 3614. 160627_195824:ireid_crab_BPH-RunIISummer15GS-00046_C Unless someone fixes the shutdown/resume process, I'm never going to finish one of these, so I'll put hosts to other things. ID: 3615 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3616 - Posted: 27 Jun 2016, 21:50:47 UTC - in response to Message 3614. Last modified: 27 Jun 2016, 21:52:41 UTC Got three of your 46_B batch - Id's: 88, 95 and 96 150,000 events is creating a huge running.log file into the VM -> 29MB after 13,000 events. Yes, I'm concerned about that. I'd turned on the "return log file" option and was gobsmacked to see that the jobs returned 12 MB result files and 45 MB log.tar.gz files. Then I unpacked one of the tar files to find the log file expands to 390 MB! Not very good... Looks like the culprit is the event generator "PYTHIA" -- it seems it gets called from scratch(?) for every event and each invocation prints out a huge prologue complete with all copyright and legal references, an ASCII-art banner, etc. I've turned off returning log files, I'll have to see if there's a way to turn off the flag pages. ID: 3616 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3620 - Posted: 30 Jun 2016, 19:30:21 UTC - in response to Message 3616. The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements). Apparently the problem is that PYTHIA does an init() after every 100 events (Aside: as an experienced -- not necessarily expert -- programmer, I have to wonder why. These "traditional" methods worry me, what real error are they sweeping under the carpet?). And, by default, the init() prints info about the generator state. Solution: turn off all init verbosity. ID: 3620 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3623 - Posted: 2 Jul 2016, 17:57:12 UTC - in response to Message 3620. Last modified: 2 Jul 2016, 17:59:23 UTC The good news is that after posting a query on an expert forum, I have received some suggestions on how to change PYTHIA parameters to eliminate the verbosity. I shall be testing them tomorrow (as soon as I work out how to translate PYTHIA-style declarations into CMSSW-style python statements). Apparently the problem is that PYTHIA does an init() after every 100 events (Aside: as an experienced -- not necessarily expert -- programmer, I have to wonder why. These "traditional" methods worry me, what real error are they sweeping under the carpet?). And, by default, the init() prints info about the generator state. Solution: turn off all init verbosity. OK, I've managed to get the stdout down to 20 MB or less, but I've also dropped Nevents from 150,000 to 100,000 in the next batch. I'm not convinced that will affect our error rates, as initially one individual was responsible for nearly half our failures due to some problem reading files from /cvmfs -- looks like a network problem but at the moment I don't have time to go to Poland to sort it out... Of course, the inflated stdout files also affected the logs and we almost ran out of space on the Condor server again. I think I just noticed in time, there weren't any PostProc failures that I could see. ID: 3623 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 431	Message 3624 - Posted: 3 Jul 2016, 6:43:45 UTC Well done weekend work, Ivan! ID: 3624 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3625 - Posted: 3 Jul 2016, 9:19:19 UTC - in response to Message 3624. Well done weekend work, Ivan! Thanks. I'll be away Mon-Wed, but I believe Laurence is back tomorrow. ID: 3625 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3626 - Posted: 3 Jul 2016, 10:19:05 UTC I looks like a few hosts are producing a large number of fails in the latest batch, again. Was there not a fix, that stops hosts to produce large number of fails? I guess,not, because the actual boinc tasks does not error out. ID: 3626 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3627 - Posted: 3 Jul 2016, 14:12:20 UTC - in response to Message 3626. I looks like a few hosts are producing a large number of fails in the latest batch, again. Was there not a fix, that stops hosts to produce large number of fails? I guess,not, because the actual boinc tasks does not error out. There's one user producing a majority of the fails, and another most of the rest. I'm not totally concerned about it for the present, because they are only hurting themselves, not the rest of the Volunteers (and I've got other things to do just at present). And it should not be a great concern to the physicist who requested this workflow -- it's this or nothing for some investigations; he's only getting 7/8ths of the submissions, but that's better than nothing out of nothing. I think some of the protections may have fallen out when the project was refactored to cope with several projects, or maybe they were not transferred to the standard project (both users are from the vLHC beta). With Laurence back tomorrow, we might find a better way to guard against these (possibly kill a task if it turns up an error 65 job-result?). I'll try to get a summary to him before I start packing tonight. ID: 3627 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 3628 - Posted: 3 Jul 2016, 16:27:30 UTC - in response to Message 3627. Thanks, Ivan. Have a save trip. ID: 3628 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,874,101 RAC: 172	Message 3637 - Posted: 6 Jul 2016, 20:00:33 UTC - in response to Message 3628. Sorry about the glitch tonight, I'd submitted a small batch to see what effect another attempt at reducing Pythia's verbosity had made, and set up a small cron job to compress the log files returned to the Condor server -- they were depleting free space on the /home partition more than "usual". Then I got distracted by travel, and a request to help test a new Compute Element in our Grid farm. There were a lot of Condor jobs waiting for input by the time I noticed the well was dry. :-) ID: 3637 · Rating: 0 · rate: / Reply Quote

Development for LHC@home