Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
Point taken, and I know this is a priority for the development team. I just don't know how far off it is from being implemented, sorry. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
I'm getting an error submitting a new batch; please bear with me... [Edit] Fixed now. Some CRAB parameters changed names in the latest release... [/Edit] |
Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0 |
... I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. Hope you'll still be able to read that small print when the day comes… I can (just!) and my day came in 2002.... Ben |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. I have been trying to watch CPU utilization, and the current approach bothers me... SOUNDS like (sort of verified by the data I'm seeing, with some odd anomalies) the current method is to launch the result on our host and either a) do "n" events, then sit there and do nothing for the rest of the 24 hours, taking up an unnecessary "slot" on our host, or b) work for 24 hours then terminate even though there are "n-x" events still to be processed (losing the one that was in process as well). The "different" model sounds like the plan is to completely replicate BOINC on a micro-scale, where the "result" acts like BOINC Manager, requesting work from the project, and either a) getting and processing an "event" (micro-result) or b) getting "no work available" and sitting there, again taking up an unnecessary slot on our host, until the 24 hours is up or more work becomes available. And if work becomes available at 23:59, well, oops! Lost event. Yes, if work was constantly available, no problem (except on calculating credits?) but if not, big problem. BOTH approaches seem not only to be unnecessarily complex on YOUR side, but to have a built-in inefficiency when it comes to utilization of our volunteer computing time. From a volunteer standpoint, I still strongly maintain that the CORRECT way to do this is to stick with the current model, but terminate the result when the last event (whether it be #50 or #500) has been processed!!! This is how EVERY other project I am familiar with works (don't know about LHC because I don't run it). BOTH CMS approaches, as implemented or described, also CREATE a huge difficulty in awarding the proper number of credits, at least if you follow the "rules" and award credit based on Cobblestones (actual event processing). Namely, how do you "pay" for the time you have occupied my host without actually processing anything? If you pay nothing (i.e.; fixed credit per result, whether based on the number of events - method a - or not - method b since you don't know how many events if work wasn't available at some point), bye-bye volunteers. Unless you pay as if the entire 24 hours were actually "used", and then you're guilty of credit inflation, and BOINC admin will have a problem with that. (See WCG and the fact that there, you can earn more "badges" by running the _slowest_ computers you can find. At least there it doesn't affect credit, only badges, so BOINC doesn't care. The super-crunchers do though.) You also get into the "how the heck do we describe what we're doing so new crunchers know what to expect" issue, since you'll be so much different than other projects. Surprises = more lost volunteers. Quake-Catcher is the only project I know that just pays "slot rent", but their results take almost _0_ CPU time (watching a sensor) so other projects are not prevented from running. (They don't really occupy a CPU slot at all.) They also give very little credit and only have a few thousand volunteers, most of whom are inactive. They also are closing down. The current CMS approach also causes problems with "estimated time" - my little Linux box THINKS it's going to take 30+ hours per CMS result, which means it gets less work from other projects to compensate, then has to "catch up" once the CMS result's time falls to more like reality. BOINC uses "estimated Cobblestones" to calculate this time, and CMS obviously doesn't have a clue how many Cobblestones will be done by each result beforehand with EITHER of the above approaches! Knowing how many events were "in" a result at time of sending it to us would let you calculate this with pretty good efficiency. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Oh yeah - what happens with method "b" if the host internet connection is not continuously online? There is NO way within CMS to "buffer" work (unless you COMPLETELY rewrite BOINC), so the host would connect, get a result from CMS with what, one "event' in it? Then disconnect and sit there for 24 hours, doing only that one event, before connecting again to upload it. Meanwhile having a lot of work from other projects going undone. How the heck would you pay credits for that??? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry. The first attempt in dashboard shows:" 60311 Stage Out Failure in Prod Agent", Is this a real problem or a dashboard quirk? If it is real, i think, it needs to be looked at. Generally, the failure rate seems quit low in the last batch, but increasing,slightly. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry. It's real, I'm afraid, but it's not an area I can diagnose or correct myself. Some solutions were applied last week but had little effect. It _might_ be that I've been too ambitious with the jobs lately and their result files are too big (~150 MB) for household connections. I'll drop the size in the next batch. The saving grace is that Condor resubmits most jobs up to two more times so the perceived failure rate will end up looking more like 3%. I'll tickle CERN today in case anyone's still around to have another look at it. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There seems to be a "runaway" computer. It produced 10 fails in a row with code 8022 and keeps going. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101 |
There seems to be a "runaway" computer. As far as I can tell from Dashboard, so far there are 3 IP's altogether, two with one failure each and one with 11. Ivan, could you use your magical powers to find out which hostID finished job 2388? Thanks. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
There seems to be a "runaway" computer. Yes, found him. Seems to have started around 0700 GMT today -- I thought there was an upward tick on failures in the jobs chart. I'll check the full logs and see what I need to tell him. Looks like corruption in the VM: == CMSSW: R__unzip: error -3 in inflate (zlib) == CMSSW: ----- Begin Fatal Exception 04-Jan-2016 00:36:13 KST----------------------- == CMSSW: An exception of category 'FatalRootError' occurred while == CMSSW: [0] Processing run: 1 lumi: 8164 event: 816382 == CMSSW: [1] Running path 'simulation_step' == CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits' == CMSSW: Additional Info: == CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers == CMSSW: fNbytes = 7218511, fKeylen = 109, fObjlen = 10166624, noutot = 0, nout=0, nin=7218402, nbuf=10166624 == CMSSW: == CMSSW: ----- End Fatal Exception ------------------------------------------------- I'll send a PM, but given his time-zone he might not reply for a few hours. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101 |
There seems to be a "runaway" computer. One of the IPs is mine, although the timestamp is a "bit" (like maybe 10 mins after they should shut down) wrong, I'd like to know what host it was. Thanks. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
There seems to be a "runaway" computer. I can only find one job on Dashboard for 2388, and only one log at RAL for it. The timestamps match between them, but the IP on Dashboard is yours, not the Korean machine that actually ran it. You have 24 recorded completed jobs in the current batch, all with exit status 0. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101 |
OK. Thanks Ivan, looks like Dashboard strikes again... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
There seems be another "runaway" developing. This time code 8001, 4 fails in a row. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
There seems be another "runaway" developing. Right, and a lot of 137s, which seem to have been resubmitted. Looks like another corrupt VM, PM on the way... |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,545 RAC: 1,472 |
I saw today twice a cmsRun started and after a few minutes a new cmsRun started with a new jobNumber without a reason for me. Example of job 3212 was killed after running this: Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Set Driver verbosity to -2 New QGSP_FTFP_BERT physics list, replaces LEP with FTF/P for p/n/pi (/K?) Thresholds: 1) between BERT and FTF/P over the interval 6 to 8 GeV. 2) between FTF/P and QGS/P over the interval 12 to 25 GeV. -- quasiElastic was asked to be 1 Changed to 1 for QGS and to 0 (must be false) for FTF From the StarterLog: 01/04/16 11:03:52 (pid:10968) Using wrapper /home/boinc/CMSRun/glide_GDBaae/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3212 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3212.txt --runAndLumis=job_lumis_3212.json --lheInputFiles=False --firstEvent=963301 --firstLumi=9634 --lastEvent=963601 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 01/04/16 11:03:52 (pid:10968) Running job as user (null) 01/04/16 11:03:52 (pid:10968) Create_Process succeeded, pid=10977 01/04/16 11:04:02 (pid:10968) Got SIGTERM. Performing graceful shutdown. 01/04/16 11:04:02 (pid:10968) ShutdownGraceful all jobs. 01/04/16 11:06:02 (pid:10968) ShutdownFast all jobs. 01/04/16 11:06:02 (pid:10968) Process exited, pid=10977, signal=15 01/04/16 11:06:02 (pid:10968) Last process exited, now Starter is exiting 01/04/16 11:06:02 (pid:10968) **** condor_starter (condor_STARTER) pid 10968 EXITING WITH STATUS 0 01/04/16 11:06:03 (pid:11361) ****************************************************** 01/04/16 11:06:03 (pid:11361) ** condor_starter (CONDOR_STARTER) STARTING UP Edit: To add from _condor_sdtout: ./CMSRunAnalysis.sh: line 159: 11054 Killed python CMSRunAnalysis.py -r "`pwd`" "$@" + jobrc=137 + set +x == The job had an exit code of 137 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had the same happening to me a number of times in the past. I only run 1 computer on my internet, so, there is no conflict with other devices. Jobs:1651 and 1708 from the current batch. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
I had the same happening to me a number of times in the past. In both those two cases, and CP's case, the original job returned an empty log file, so I haven't got much of a handle on it. They've all been requeued to run again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Next time it happens, i will log the log(funny?) Which one do you need? |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,934,535 RAC: 3,078 |
Next time it happens, i will log the log(funny?) The cmsRun-stdout in the first instance, I guess. |
©2024 CERN