Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1502 - Posted: 4 Dec 2015, 9:15:32 UTC - in response to Message 1501.  

Point taken, and I know this is a priority for the development team. I just don't know how far off it is from being implemented, sorry.
ID: 1502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1503 - Posted: 4 Dec 2015, 14:13:26 UTC
Last modified: 4 Dec 2015, 15:06:15 UTC

I'm getting an error submitting a new batch; please bear with me...

[Edit] Fixed now. Some CRAB parameters changed names in the latest release... [/Edit]
ID: 1503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 1504 - Posted: 4 Dec 2015, 15:22:29 UTC - in response to Message 1500.  
Last modified: 4 Dec 2015, 15:22:55 UTC

... I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017.

Hope you'll still be able to read that small print when the day comes… I can (just!) and my day came in 2002....

Ben
ID: 1504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1505 - Posted: 4 Dec 2015, 17:28:27 UTC - in response to Message 1500.  

There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017.


I have been trying to watch CPU utilization, and the current approach bothers me... SOUNDS like (sort of verified by the data I'm seeing, with some odd anomalies) the current method is to launch the result on our host and either a) do "n" events, then sit there and do nothing for the rest of the 24 hours, taking up an unnecessary "slot" on our host, or b) work for 24 hours then terminate even though there are "n-x" events still to be processed (losing the one that was in process as well).

The "different" model sounds like the plan is to completely replicate BOINC on a micro-scale, where the "result" acts like BOINC Manager, requesting work from the project, and either a) getting and processing an "event" (micro-result) or b) getting "no work available" and sitting there, again taking up an unnecessary slot on our host, until the 24 hours is up or more work becomes available. And if work becomes available at 23:59, well, oops! Lost event. Yes, if work was constantly available, no problem (except on calculating credits?) but if not, big problem.

BOTH approaches seem not only to be unnecessarily complex on YOUR side, but to have a built-in inefficiency when it comes to utilization of our volunteer computing time. From a volunteer standpoint, I still strongly maintain that the CORRECT way to do this is to stick with the current model, but terminate the result when the last event (whether it be #50 or #500) has been processed!!! This is how EVERY other project I am familiar with works (don't know about LHC because I don't run it). BOTH CMS approaches, as implemented or described, also CREATE a huge difficulty in awarding the proper number of credits, at least if you follow the "rules" and award credit based on Cobblestones (actual event processing). Namely, how do you "pay" for the time you have occupied my host without actually processing anything? If you pay nothing (i.e.; fixed credit per result, whether based on the number of events - method a - or not - method b since you don't know how many events if work wasn't available at some point), bye-bye volunteers. Unless you pay as if the entire 24 hours were actually "used", and then you're guilty of credit inflation, and BOINC admin will have a problem with that. (See WCG and the fact that there, you can earn more "badges" by running the _slowest_ computers you can find. At least there it doesn't affect credit, only badges, so BOINC doesn't care. The super-crunchers do though.) You also get into the "how the heck do we describe what we're doing so new crunchers know what to expect" issue, since you'll be so much different than other projects. Surprises = more lost volunteers.

Quake-Catcher is the only project I know that just pays "slot rent", but their results take almost _0_ CPU time (watching a sensor) so other projects are not prevented from running. (They don't really occupy a CPU slot at all.) They also give very little credit and only have a few thousand volunteers, most of whom are inactive. They also are closing down.

The current CMS approach also causes problems with "estimated time" - my little Linux box THINKS it's going to take 30+ hours per CMS result, which means it gets less work from other projects to compensate, then has to "catch up" once the CMS result's time falls to more like reality. BOINC uses "estimated Cobblestones" to calculate this time, and CMS obviously doesn't have a clue how many Cobblestones will be done by each result beforehand with EITHER of the above approaches! Knowing how many events were "in" a result at time of sending it to us would let you calculate this with pretty good efficiency.
ID: 1505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1506 - Posted: 4 Dec 2015, 17:33:49 UTC

Oh yeah - what happens with method "b" if the host internet connection is not continuously online? There is NO way within CMS to "buffer" work (unless you COMPLETELY rewrite BOINC), so the host would connect, get a result from CMS with what, one "event' in it? Then disconnect and sit there for 24 hours, doing only that one event, before connecting again to upload it. Meanwhile having a lot of work from other projects going undone. How the heck would you pay credits for that???
ID: 1506 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1511 - Posted: 21 Dec 2015, 18:18:12 UTC
Last modified: 21 Dec 2015, 18:49:56 UTC

There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry.
The first attempt in dashboard shows:" 60311 Stage Out Failure in Prod Agent",


Is this a real problem or a dashboard quirk?

If it is real, i think, it needs to be looked at.

Generally, the failure rate seems quit low in the last batch, but increasing,slightly.
ID: 1511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1512 - Posted: 22 Dec 2015, 9:19:47 UTC - in response to Message 1511.  

There is a very large number of initial failures(approx 33%), now listed as "pending" rather than waiting for retry.
The first attempt in dashboard shows:" 60311 Stage Out Failure in Prod Agent",


Is this a real problem or a dashboard quirk?

If it is real, i think, it needs to be looked at.

Generally, the failure rate seems quit low in the last batch, but increasing,slightly.

It's real, I'm afraid, but it's not an area I can diagnose or correct myself. Some solutions were applied last week but had little effect. It _might_ be that I've been too ambitious with the jobs lately and their result files are too big (~150 MB) for household connections. I'll drop the size in the next batch. The saving grace is that Condor resubmits most jobs up to two more times so the perceived failure rate will end up looking more like 3%.
I'll tickle CERN today in case anyone's still around to have another look at it.
ID: 1512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1542 - Posted: 3 Jan 2016, 10:41:56 UTC

There seems to be a "runaway" computer.
It produced 10 fails in a row with code 8022 and keeps going.
ID: 1542 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1543 - Posted: 3 Jan 2016, 12:18:56 UTC - in response to Message 1542.  
Last modified: 3 Jan 2016, 12:25:57 UTC

There seems to be a "runaway" computer.
It produced 10 fails in a row with code 8022 and keeps going.

As far as I can tell from Dashboard, so far there are 3 IP's altogether,
two with one failure each and one with 11.
Ivan, could you use your magical powers to find out which hostID finished
job 2388? Thanks.
ID: 1543 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1545 - Posted: 3 Jan 2016, 15:55:21 UTC - in response to Message 1543.  
Last modified: 3 Jan 2016, 16:05:44 UTC

There seems to be a "runaway" computer.
It produced 10 fails in a row with code 8022 and keeps going.

As far as I can tell from Dashboard, so far there are 3 IP's altogether,
two with one failure each and one with 11.
Ivan, could you use your magical powers to find out which hostID finished
job 2388? Thanks.

Yes, found him. Seems to have started around 0700 GMT today -- I thought there was an upward tick on failures in the jobs chart. I'll check the full logs and see what I need to tell him.
Looks like corruption in the VM:
== CMSSW: R__unzip: error -3 in inflate (zlib)
== CMSSW: ----- Begin Fatal Exception 04-Jan-2016 00:36:13 KST-----------------------
== CMSSW: An exception of category 'FatalRootError' occurred while
== CMSSW: [0] Processing run: 1 lumi: 8164 event: 816382
== CMSSW: [1] Running path 'simulation_step'
== CMSSW: [2] Calling event method for module OscarProducer/'g4SimHits'
== CMSSW: Additional Info:
== CMSSW: [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
== CMSSW: fNbytes = 7218511, fKeylen = 109, fObjlen = 10166624, noutot = 0, nout=0, nin=7218402, nbuf=10166624
== CMSSW:
== CMSSW: ----- End Fatal Exception -------------------------------------------------

I'll send a PM, but given his time-zone he might not reply for a few hours.
ID: 1545 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1546 - Posted: 3 Jan 2016, 16:18:36 UTC - in response to Message 1545.  

There seems to be a "runaway" computer.
It produced 10 fails in a row with code 8022 and keeps going.

As far as I can tell from Dashboard, so far there are 3 IP's altogether,
two with one failure each and one with 11.
Ivan, could you use your magical powers to find out which hostID finished
job 2388? Thanks.

One of the IPs is mine, although the timestamp is a "bit" (like maybe 10 mins
after they should shut down) wrong, I'd like to know what host it was. Thanks.
ID: 1546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1547 - Posted: 3 Jan 2016, 17:37:30 UTC - in response to Message 1546.  

There seems to be a "runaway" computer.
It produced 10 fails in a row with code 8022 and keeps going.

As far as I can tell from Dashboard, so far there are 3 IP's altogether,
two with one failure each and one with 11.
Ivan, could you use your magical powers to find out which hostID finished
job 2388? Thanks.

One of the IPs is mine, although the timestamp is a "bit" (like maybe 10 mins
after they should shut down) wrong, I'd like to know what host it was. Thanks.

I can only find one job on Dashboard for 2388, and only one log at RAL for it. The timestamps match between them, but the IP on Dashboard is yours, not the Korean machine that actually ran it. You have 24 recorded completed jobs in the current batch, all with exit status 0.
ID: 1547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1548 - Posted: 3 Jan 2016, 18:07:27 UTC - in response to Message 1547.  

OK. Thanks Ivan, looks like Dashboard strikes again...
ID: 1548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1549 - Posted: 3 Jan 2016, 19:42:00 UTC
Last modified: 3 Jan 2016, 20:13:17 UTC

There seems be another "runaway" developing.
This time code 8001, 4 fails in a row.
ID: 1549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1550 - Posted: 4 Jan 2016, 11:39:04 UTC - in response to Message 1549.  

There seems be another "runaway" developing.
This time code 8001, 4 fails in a row.

Right, and a lot of 137s, which seem to have been resubmitted.
Looks like another corrupt VM, PM on the way...
ID: 1550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,545
RAC: 1,472
Message 1551 - Posted: 4 Jan 2016, 11:41:01 UTC
Last modified: 4 Jan 2016, 12:36:29 UTC

I saw today twice a cmsRun started and after a few minutes a new cmsRun started with a new jobNumber without a reason for me.

Example of job 3212 was killed after running this:

Beginning CMSSW wrapper script
slc6_amd64_gcc472 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
Set Driver verbosity to -2
New QGSP_FTFP_BERT physics list, replaces LEP with FTF/P for p/n/pi (/K?) Thresholds:
1) between BERT and FTF/P over the interval 6 to 8 GeV.
2) between FTF/P and QGS/P over the interval 12 to 25 GeV.
-- quasiElastic was asked to be 1
Changed to 1 for QGS and to 0 (must be false) for FTF


From the StarterLog:

01/04/16 11:03:52 (pid:10968) Using wrapper /home/boinc/CMSRun/glide_GDBaae/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_GDBaae/execute/dir_10968/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=3212 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_3212.txt --runAndLumis=job_lumis_3212.json --lheInputFiles=False --firstEvent=963301 --firstLumi=9634 --lastEvent=963601 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {}
01/04/16 11:03:52 (pid:10968) Running job as user (null)
01/04/16 11:03:52 (pid:10968) Create_Process succeeded, pid=10977
01/04/16 11:04:02 (pid:10968) Got SIGTERM. Performing graceful shutdown.
01/04/16 11:04:02 (pid:10968) ShutdownGraceful all jobs.
01/04/16 11:06:02 (pid:10968) ShutdownFast all jobs.
01/04/16 11:06:02 (pid:10968) Process exited, pid=10977, signal=15
01/04/16 11:06:02 (pid:10968) Last process exited, now Starter is exiting
01/04/16 11:06:02 (pid:10968) **** condor_starter (condor_STARTER) pid 10968 EXITING WITH STATUS 0
01/04/16 11:06:03 (pid:11361) ******************************************************
01/04/16 11:06:03 (pid:11361) ** condor_starter (CONDOR_STARTER) STARTING UP


Edit: To add from _condor_sdtout:

./CMSRunAnalysis.sh: line 159: 11054 Killed python CMSRunAnalysis.py -r "`pwd`" "$@"
+ jobrc=137
+ set +x
== The job had an exit code of 137
ID: 1551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1552 - Posted: 4 Jan 2016, 12:26:32 UTC

I had the same happening to me a number of times in the past.
I only run 1 computer on my internet, so, there is no conflict with other devices.
Jobs:1651 and 1708 from the current batch.
ID: 1552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1553 - Posted: 4 Jan 2016, 13:55:58 UTC - in response to Message 1552.  
Last modified: 4 Jan 2016, 13:56:15 UTC

I had the same happening to me a number of times in the past.
I only run 1 computer on my internet, so, there is no conflict with other devices.
Jobs:1651 and 1708 from the current batch.

In both those two cases, and CP's case, the original job returned an empty log file, so I haven't got much of a handle on it. They've all been requeued to run again.
ID: 1553 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1554 - Posted: 4 Jan 2016, 14:12:32 UTC - in response to Message 1553.  

Next time it happens, i will log the log(funny?)
Which one do you need?
ID: 1554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,934,535
RAC: 3,078
Message 1555 - Posted: 4 Jan 2016, 16:08:19 UTC - in response to Message 1554.  

Next time it happens, i will log the log(funny?)
Which one do you need?

The cmsRun-stdout in the first instance, I guess.
ID: 1555 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN