Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2036 - Posted: 17 Feb 2016, 11:18:19 UTC - in response to Message 2035.  

I think there is an issue with the glidein at that URL.

Yeah, the description file should be found 1 dir-level lower in map entry_volunteer.
ID: 2036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2037 - Posted: 17 Feb 2016, 11:27:44 UTC - in response to Message 2035.  
Last modified: 17 Feb 2016, 11:56:42 UTC

My mistake, I changed the glidein argument for -descript and should have changed -descriptentry.

Should be fixed once CVMFS has updated.
ID: 2037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2038 - Posted: 17 Feb 2016, 13:01:56 UTC - in response to Message 2037.  

Should be working again.
ID: 2038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2039 - Posted: 17 Feb 2016, 13:27:06 UTC - in response to Message 2038.  

Should be working again.

Fixed!
ID: 2039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2041 - Posted: 17 Feb 2016, 14:17:11 UTC - in response to Message 2030.  

We appear to have a problem.
Large numbers of jobs stuck in WNPostProc on server.
Some jobs have 2 WNPostProc running simultaneously.

Oops, my bad -- probably not the JLD. The /home partition filled up with the Condor logs and took me by surprise. We've been going through jobs a leetle bit more quickly just lately. Archived off logs from a couple more completed batches.
Must remember that symptom, we've had it before!
ID: 2041 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2048 - Posted: 17 Feb 2016, 22:09:54 UTC

Not sure what to make of the current wall-time plot. Must be a few old jobs turning up after a long time, as there aren't many failures in the middle graph.
ID: 2048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2055 - Posted: 18 Feb 2016, 13:59:02 UTC

Another runaway.
Responsible for almost all 134 exit codes.
Caused 19 of currently 65 fails.

I am not reporting these any more.
ID: 2055 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2057 - Posted: 18 Feb 2016, 16:29:51 UTC - in response to Message 2055.  

It is on the todo list. Will start working on it once the suspend-resume issue is fixed and I have released boinc v7.6.22 on Fedora. A protection for this should come at the same time as handling the situation with no jobs.
ID: 2057 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2060 - Posted: 19 Feb 2016, 8:41:37 UTC

I have a job, that ran to event 5 and then started a new job, not a new run and continued.
The job is listed in dashboard as running.
Logs are available, if someone wants them.
ID: 2060 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2063 - Posted: 19 Feb 2016, 10:57:34 UTC
Last modified: 19 Feb 2016, 11:26:08 UTC

Jobs in WNPostProc state are going trough the roof.
I really think, you need to do something about that.

edit: The last "finished" reported job was yesterday at 20.45 UTC.
ID: 2063 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2068 - Posted: 19 Feb 2016, 23:31:16 UTC - in response to Message 2063.  
Last modified: 19 Feb 2016, 23:32:42 UTC

I have no idea what's going on. My Linux machine is sending _condor_stdout files that are totally fine, but the jobs end up in WNPostProc state and no log files appear on the server. It's not a disk-space problem this time:
[cms005@lcggwms02:~] > df -h .
Filesystem Size Used Avail Use% Mounted on
/dev/sda2 9.9G 6.6G 2.9G 70% /

I'm having to leave this to Laurence et al. to sort out, it's beyond my ken.

Oh, according to Condor we're almost out of jobs, so I submitted another 10,000.
ID: 2068 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2069 - Posted: 19 Feb 2016, 23:57:46 UTC - in response to Message 2068.  

Thanks, Ivan.
As vLHC does not put any jobs on, we should not need them soon(still 6000 to go).
I am sick and tired of reminding them, so s$%&* them.Nobody cares.

Maybe this project could allow 2 or 3 tasks or better, enable multi-core operation.
I will not start another task, until i can verify,that the jobs are validated(for all i know, they could all be failing)
ID: 2069 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2095 - Posted: 26 Feb 2016, 14:37:53 UTC
Last modified: 26 Feb 2016, 14:39:50 UTC

Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started:

02/26/16 12:49:56 (pid:7634) Create_Process succeeded, pid=7640
02/26/16 12:50:00 (pid:7634) Got SIGTERM. Performing graceful shutdown.
ID: 2095 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2096 - Posted: 26 Feb 2016, 14:41:32 UTC - in response to Message 2095.  

Maybe they just ran out of jobs. There were only 100. All gone.
ID: 2096 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2097 - Posted: 26 Feb 2016, 15:18:19 UTC - in response to Message 2096.  

Maybe they just ran out of jobs. There were only 100. All gone.

New ones in the pipeline: 7077 jobs of the batch 160226_150549:ireid_crab_CMS_at_Home_MinBias_250evE and increasing
ID: 2097 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2099 - Posted: 26 Feb 2016, 16:10:00 UTC - in response to Message 2095.  

Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started:

02/26/16 12:49:56 (pid:7634) Create_Process succeeded, pid=7640
02/26/16 12:50:00 (pid:7634) Got SIGTERM. Performing graceful shutdown.
That job had two retries before it finally returned a log file:
ls -l 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.*txt
-rw-r--r-- 1 cms005 cms005 0 Feb 26 11:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.0.txt
-rw-r--r-- 1 cms005 cms005 0 Feb 26 12:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.1.txt
-rw-r--r-- 1 cms005 cms005 138693 Feb 26 12:56 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.2.txt

Can't tell much from that.
ID: 2099 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2100 - Posted: 26 Feb 2016, 18:58:08 UTC

I have one of these.

Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333...
++ pwd
+ python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=53 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_53.txt --runAndLumis=job_lumis_53.json --lheInputFiles=False --firstEvent=13001 --firstLumi=157 --lastEvent=13251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0
Job Running time in seconds: 3
Job runtime is less than 20minutes. Sleeping 1197
==== CMSRunAnalysis.py STARTING at Fri Feb 26 16:35:01 2016 ====


Why is it still referring to 100events instead of 250?
ID: 2100 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2103 - Posted: 26 Feb 2016, 19:22:46 UTC

In the batch of 100 there are, again, 2 jobs, that are stuck in Postproc for more than 5 hours now.
ID: 2103 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2104 - Posted: 26 Feb 2016, 19:25:15 UTC - in response to Message 2100.  

I have one of these.

Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333...
++ pwd
+ python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=53 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_53.txt --runAndLumis=job_lumis_53.json --lheInputFiles=False --firstEvent=13001 --firstLumi=157 --lastEvent=13251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0
Job Running time in seconds: 3
Job runtime is less than 20minutes. Sleeping 1197
==== CMSRunAnalysis.py STARTING at Fri Feb 26 16:35:01 2016 ====


Why is it still referring to 100events instead of 250?

That's 100 events per "luminosity section". In the real world the luminosity of the LHC beams dies away roughly exponentially due to various losses. We can't efficiently monitor the luminosity for every bunch-crossing, but we can define a period of time short enough that the luminosity doesn't change significantly over it, and then define (tabulate or parametrise?) the luminosity for each of these "lumis", to be used in working out collision cross-sections. I'm not sure how the period is defined, I think I saw it once as some power-of-two count of the bunch crossings. Since the calculated Monte-Carlo data have to look as real as possible, they also have "lumi sections". I'm not sure how CRAB (or CMSSW) decided to calculate the events-per-lumi for these batches, I was under the impression that each job had to be a LS or a multiple of one to allow run-merging. In fact the numbers don't add up -- first event of job 53 should be 13001, but how do you get a lumi number of 157?
Ah, they must be taking every job as 3 lumis, events 1-100=>1, 101-200=>2, and 201-250=>3, then the 53rd job would be starting at lumi 52x3+1=157. I suppose it makes sense to someone...
ID: 2104 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2107 - Posted: 26 Feb 2016, 19:51:18 UTC - in response to Message 2104.  

Thanks for the explanation.
I just wanted to make sure, that it was not an error of some kind.
ID: 2107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN