Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 · Next
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
I think there is an issue with the glidein at that URL. Yeah, the description file should be found 1 dir-level lower in map entry_volunteer. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
My mistake, I changed the glidein argument for -descript and should have changed -descriptentry. Should be fixed once CVMFS has updated. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
Should be working again. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Should be working again. Fixed! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
We appear to have a problem. Oops, my bad -- probably not the JLD. The /home partition filled up with the Condor logs and took me by surprise. We've been going through jobs a leetle bit more quickly just lately. Archived off logs from a couple more completed batches. Must remember that symptom, we've had it before! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
Not sure what to make of the current wall-time plot. Must be a few old jobs turning up after a long time, as there aren't many failures in the middle graph. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Another runaway. Responsible for almost all 134 exit codes. Caused 19 of currently 65 fails. I am not reporting these any more. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
It is on the todo list. Will start working on it once the suspend-resume issue is fixed and I have released boinc v7.6.22 on Fedora. A protection for this should come at the same time as handling the situation with no jobs. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have a job, that ran to event 5 and then started a new job, not a new run and continued. The job is listed in dashboard as running. Logs are available, if someone wants them. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Jobs in WNPostProc state are going trough the roof. I really think, you need to do something about that. edit: The last "finished" reported job was yesterday at 20.45 UTC. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
I have no idea what's going on. My Linux machine is sending _condor_stdout files that are totally fine, but the jobs end up in WNPostProc state and no log files appear on the server. It's not a disk-space problem this time: [cms005@lcggwms02:~] > df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda2 9.9G 6.6G 2.9G 70% / I'm having to leave this to Laurence et al. to sort out, it's beyond my ken. Oh, according to Condor we're almost out of jobs, so I submitted another 10,000. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. As vLHC does not put any jobs on, we should not need them soon(still 6000 to go). I am sick and tired of reminding them, so s$%&* them.Nobody cares. Maybe this project could allow 2 or 3 tasks or better, enable multi-core operation. I will not start another task, until i can verify,that the jobs are validated(for all i know, they could all be failing) |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started: 02/26/16 12:49:56 (pid:7634) Create_Process succeeded, pid=7640 02/26/16 12:50:00 (pid:7634) Got SIGTERM. Performing graceful shutdown. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Maybe they just ran out of jobs. There were only 100. All gone. |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
Maybe they just ran out of jobs. There were only 100. All gone. New ones in the pipeline: 7077 jobs of the batch 160226_150549:ireid_crab_CMS_at_Home_MinBias_250evE and increasing |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started:That job had two retries before it finally returned a log file: ls -l 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.*txt -rw-r--r-- 1 cms005 cms005 0 Feb 26 11:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.0.txt -rw-r--r-- 1 cms005 cms005 0 Feb 26 12:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.1.txt -rw-r--r-- 1 cms005 cms005 138693 Feb 26 12:56 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.2.txt Can't tell much from that. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have one of these. Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333... ++ pwd + python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=53 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_53.txt --runAndLumis=job_lumis_53.json --lheInputFiles=False --firstEvent=13001 --firstLumi=157 --lastEvent=13251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0 Job Running time in seconds: 3 Job runtime is less than 20minutes. Sleeping 1197 ==== CMSRunAnalysis.py STARTING at Fri Feb 26 16:35:01 2016 ==== Why is it still referring to 100events instead of 250? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
In the batch of 100 there are, again, 2 jobs, that are stuck in Postproc for more than 5 hours now. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
I have one of these. That's 100 events per "luminosity section". In the real world the luminosity of the LHC beams dies away roughly exponentially due to various losses. We can't efficiently monitor the luminosity for every bunch-crossing, but we can define a period of time short enough that the luminosity doesn't change significantly over it, and then define (tabulate or parametrise?) the luminosity for each of these "lumis", to be used in working out collision cross-sections. I'm not sure how the period is defined, I think I saw it once as some power-of-two count of the bunch crossings. Since the calculated Monte-Carlo data have to look as real as possible, they also have "lumi sections". I'm not sure how CRAB (or CMSSW) decided to calculate the events-per-lumi for these batches, I was under the impression that each job had to be a LS or a multiple of one to allow run-merging. In fact the numbers don't add up -- first event of job 53 should be 13001, but how do you get a lumi number of 157? Ah, they must be taking every job as 3 lumis, events 1-100=>1, 101-200=>2, and 201-250=>3, then the 53rd job would be starting at lumi 52x3+1=157. I suppose it makes sense to someone... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for the explanation. I just wanted to make sure, that it was not an error of some kind. |
©2024 CERN