Thread 'Expect errors eventually'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1502 - Posted: 4 Dec 2015, 9:15:32 UTC - in response to Message 1501. Point taken, and I know this is a priority for the development team. I just don't know how far off it is from being implemented, sorry. ID: 1502 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1503 - Posted: 4 Dec 2015, 14:13:26 UTC Last modified: 4 Dec 2015, 15:06:15 UTC I'm getting an error submitting a new batch; please bear with me... [Edit] Fixed now. Some CRAB parameters changed names in the latest release... [/Edit] ID: 1503 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 1504 - Posted: 4 Dec 2015, 15:22:29 UTC - in response to Message 1500. Last modified: 4 Dec 2015, 15:22:55 UTC ... I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. Hope you'll still be able to read that small print when the day comes… I can (just!) and my day came in 2002.... Ben ID: 1504 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1505 - Posted: 4 Dec 2015, 17:28:27 UTC - in response to Message 1500. There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. I have been trying to watch CPU utilization, and the current approach bothers me... SOUNDS like (sort of verified by the data I'm seeing, with some odd anomalies) the current method is to launch the result on our host and either a) do "n" events, then sit there and do nothing for the rest of the 24 hours, taking up an unnecessary "slot" on our host, or b) work for 24 hours then terminate even though there are "n-x" events still to be processed (losing the one that was in process as well). The "different" model sounds like the plan is to completely replicate BOINC on a micro-scale, where the "result" acts like BOINC Manager, requesting work from the project, and either a) getting and processing an "event" (micro-result) or b) getting "no work available" and sitting there, again taking up an unnecessary slot on our host, until the 24 hours is up or more work becomes available. And if work becomes available at 23:59, well, oops! Lost event. Yes, if work was constantly available, no problem (except on calculating credits?) but if not, big problem. BOTH approaches seem not only to be unnecessarily complex on YOUR side, but to have a built-in inefficiency when it comes to utilization of our volunteer computing time. From a volunteer standpoint, I still strongly maintain that the CORRECT way to do this is to stick with the current model, but terminate the result when the last event (whether it be #50 or #500) has been processed!!! This is how EVERY other project I am familiar with works (don't know about LHC because I don't run it). BOTH CMS approaches, as implemented or described, also CREATE a huge difficulty in awarding the proper number of credits, at least if you follow the "rules" and award credit based on Cobblestones (actual event processing). Namely, how do you "pay" for the time you have occupied my host without actually processing anything? If you pay nothing (i.e.; fixed credit per result, whether based on the number of events - method a - or not - method b since you don't know how many events if work wasn't available at some point), bye-bye volunteers. Unless you pay as if the entire 24 hours were actually "used", and then you're guilty of credit inflation, and BOINC admin will have a problem with that. (See WCG and the fact that there, you can earn more "badges" by running the _slowest_ computers you can find. At least there it doesn't affect credit, only badges, so BOINC doesn't care. The super-crunchers do though.) You also get into the "how the heck do we describe what we're doing so new crunchers know what to expect" issue, since you'll be so much different than other projects. Surprises = more lost volunteers. Quake-Catcher is the only project I know that just pays "slot rent", but their results take almost _0_ CPU time (watching a sensor) so other projects are not prevented from running. (They don't really occupy a CPU slot at all.) They also give very little credit and only have a few thousand volunteers, most of whom are inactive. They also are closing down. The current CMS approach also causes problems with "estimated time" - my little Linux box THINKS it's going to take 30+ hours per CMS result, which means it gets less work from other projects to compensate, then has to "catch up" once the CMS result's time falls to more like reality. BOINC uses "estimated Cobblestones" to calculate this time, and CMS obviously doesn't have a clue how many Cobblestones will be done by each result beforehand with EITHER of the above approaches! Knowing how many events were "in" a result at time of sending it to us would let you calculate this with pretty good efficiency. ID: 1505 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1506 - Posted: 4 Dec 2015, 17:33:49 UTC Oh yeah - what happens with method "b" if the host internet connection is not continuously online? There is NO way within CMS to "buffer" work (unless you COMPLETELY rewrite BOINC), so the host would connect, get a result from CMS with what, one "event' in it? Then disconnect and sit there for 24 hours, doing only that one event, before connecting again to upload it. Meanwhile having a lot of work from other projects going undone. How the heck would you pay credits for that??? ID: 1506 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1511 - Posted: 21 Dec 2015, 18:18:12 UTC Last modified: 21 Dec 2015, 18:49:56 UTC There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry. The first attempt in dashboard shows:" 60311 Stage Out Failure in Prod Agent", Is this a real problem or a dashboard quirk? If it is real, i think, it needs to be looked at. Generally, the failure rate seems quit low in the last batch, but increasing,slightly. ID: 1511 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1512 - Posted: 22 Dec 2015, 9:19:47 UTC - in response to Message 1511. There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry. The first attempt in dashboard shows:" 60311 Stage Out Failure in Prod Agent", Is this a real problem or a dashboard quirk? If it is real, i think, it needs to be looked at. Generally, the failure rate seems quit low in the last batch, but increasing,slightly. It's real, I'm afraid, but it's not an area I can diagnose or correct myself. Some solutions were applied last week but had little effect. It _might_ be that I've been too ambitious with the jobs lately and their result files are too big (~150 MB) for household connections. I'll drop the size in the next batch. The saving grace is that Condor resubmits most jobs up to two more times so the perceived failure rate will end up looking more like 3%. I'll tickle CERN today in case anyone's still around to have another look at it. ID: 1512 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1542 - Posted: 3 Jan 2016, 10:41:56 UTC There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. ID: 1542 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1543 - Posted: 3 Jan 2016, 12:18:56 UTC - in response to Message 1542. Last modified: 3 Jan 2016, 12:25:57 UTC There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. As far as I can tell from Dashboard, so far there are 3 IP's altogether, two with one failure each and one with 11. Ivan, could you use your magical powers to find out which hostID finished job 2388? Thanks. ID: 1543 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1545 - Posted: 3 Jan 2016, 15:55:21 UTC - in response to Message 1543. Last modified: 3 Jan 2016, 16:05:44 UTC There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. As far as I can tell from Dashboard, so far there are 3 IP's altogether, two with one failure each and one with 11. Ivan, could you use your magical powers to find out which hostID finished job 2388? Thanks. Yes, found him. Seems to have started around 0700 GMT today -- I thought there was an upward tick on failures in the jobs chart. I'll check the full logs and see what I need to tell him. Looks like corruption in the VM: == CMSSW: R__unzip: error -3 in inflate (zlib) == CMSSW: ----- Begin Fatal Exception 04-Jan-2016 00:36:13 KST----------------------- == CMSSW: An exception of category 'FatalRootError' occurred while == CMSSW: [0] Processing run: 1 lumi: 8164 event: 816382 == CMSSW: [1] Running path 'simulation_step' == CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits' == CMSSW: Additional Info: == CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers == CMSSW: fNbytes = 7218511, fKeylen = 109, fObjlen = 10166624, noutot = 0, nout=0, nin=7218402, nbuf=10166624 == CMSSW: == CMSSW: ----- End Fatal Exception ------------------------------------------------- I'll send a PM, but given his time-zone he might not reply for a few hours. ID: 1545 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1546 - Posted: 3 Jan 2016, 16:18:36 UTC - in response to Message 1545. There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. As far as I can tell from Dashboard, so far there are 3 IP's altogether, two with one failure each and one with 11. Ivan, could you use your magical powers to find out which hostID finished job 2388? Thanks. One of the IPs is mine, although the timestamp is a "bit" (like maybe 10 mins after they should shut down) wrong, I'd like to know what host it was. Thanks. ID: 1546 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1547 - Posted: 3 Jan 2016, 17:37:30 UTC - in response to Message 1546. There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. As far as I can tell from Dashboard, so far there are 3 IP's altogether, two with one failure each and one with 11. Ivan, could you use your magical powers to find out which hostID finished job 2388? Thanks. One of the IPs is mine, although the timestamp is a "bit" (like maybe 10 mins after they should shut down) wrong, I'd like to know what host it was. Thanks. I can only find one job on Dashboard for 2388, and only one log at RAL for it. The timestamps match between them, but the IP on Dashboard is yours, not the Korean machine that actually ran it. You have 24 recorded completed jobs in the current batch, all with exit status 0. ID: 1547 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 1548 - Posted: 3 Jan 2016, 18:07:27 UTC - in response to Message 1547. OK. Thanks Ivan, looks like Dashboard strikes again... ID: 1548 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1549 - Posted: 3 Jan 2016, 19:42:00 UTC Last modified: 3 Jan 2016, 20:13:17 UTC There seems be another "runaway" developing. This time code 8001, 4 fails in a row. ID: 1549 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1550 - Posted: 4 Jan 2016, 11:39:04 UTC - in response to Message 1549. There seems be another "runaway" developing. This time code 8001, 4 fails in a row. Right, and a lot of 137s, which seem to have been resubmitted. Looks like another corrupt VM, PM on the way... ID: 1550 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1267 Credit: 1,027,748 RAC: 113	Message 1551 - Posted: 4 Jan 2016, 11:41:01 UTC Last modified: 4 Jan 2016, 12:36:29 UTC I saw today twice a cmsRun started and after a few minutes a new cmsRun started with a new jobNumber without a reason for me. Example of job 3212 was killed after running this: Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Set Driver verbosity to -2 New QGSP_FTFP_BERT physics list, replaces LEP with FTF/P for p/n/pi (/K?) Thresholds: 1) between BERT and FTF/P over the interval 6 to 8 GeV. 2) between FTF/P and QGS/P over the interval 12 to 25 GeV. -- quasiElastic was asked to be 1 Changed to 1 for QGS and to 0 (must be false) for FTF From the StarterLog: 01/04/16 11:03:52 (pid:10968) Using wrapper /home/boinc/CMSRun/glide_GDBaae/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3212 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3212.txt --runAndLumis=job_lumis_3212.json --lheInputFiles=False --firstEvent=963301 --firstLumi=9634 --lastEvent=963601 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 01/04/16 11:03:52 (pid:10968) Running job as user (null) 01/04/16 11:03:52 (pid:10968) Create_Process succeeded, pid=10977 01/04/16 11:04:02 (pid:10968) Got SIGTERM. Performing graceful shutdown. 01/04/16 11:04:02 (pid:10968) ShutdownGraceful all jobs. 01/04/16 11:06:02 (pid:10968) ShutdownFast all jobs. 01/04/16 11:06:02 (pid:10968) Process exited, pid=10977, signal=15 01/04/16 11:06:02 (pid:10968) Last process exited, now Starter is exiting 01/04/16 11:06:02 (pid:10968) ** condor_starter (condor_STARTER) pid 10968 EXITING WITH STATUS 0 01/04/16 11:06:03 (pid:11361) ************************************************** 01/04/16 11:06:03 (pid:11361) condor_starter (CONDOR_STARTER) STARTING UP Edit: To add from _condor_sdtout: ./CMSRunAnalysis.sh: line 159: 11054 Killed python CMSRunAnalysis.py -r "`pwd`" "$@" + jobrc=137 + set +x == The job had an exit code of 137 ID: 1551 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1552 - Posted: 4 Jan 2016, 12:26:32 UTC I had the same happening to me a number of times in the past. I only run 1 computer on my internet, so, there is no conflict with other devices. Jobs:1651 and 1708 from the current batch. ID: 1552 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1553 - Posted: 4 Jan 2016, 13:55:58 UTC - in response to Message 1552. Last modified: 4 Jan 2016, 13:56:15 UTC I had the same happening to me a number of times in the past. I only run 1 computer on my internet, so, there is no conflict with other devices. Jobs:1651 and 1708 from the current batch. In both those two cases, and CP's case, the original job returned an empty log file, so I haven't got much of a handle on it. They've all been requeued to run again. ID: 1553 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 1554 - Posted: 4 Jan 2016, 14:12:32 UTC - in response to Message 1553. Next time it happens, i will log the log(funny?) Which one do you need? ID: 1554 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1155 Credit: 8,388,869 RAC: 1,080	Message 1555 - Posted: 4 Jan 2016, 16:08:19 UTC - in response to Message 1554. Next time it happens, i will log the log(funny?) Which one do you need? The cmsRun-stdout in the first instance, I guess. ID: 1555 · Rating: 0 · rate: / Reply Quote

Development for LHC@home