Expect errors eventually

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 2036 - Posted: 17 Feb 2016, 11:18:19 UTC - in response to Message 2035. I think there is an issue with the glidein at that URL. Yeah, the description file should be found 1 dir-level lower in map entry_volunteer. ID: 2036 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129	Message 2037 - Posted: 17 Feb 2016, 11:27:44 UTC - in response to Message 2035. Last modified: 17 Feb 2016, 11:56:42 UTC My mistake, I changed the glidein argument for -descript and should have changed -descriptentry. Should be fixed once CVMFS has updated. ID: 2037 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129	Message 2038 - Posted: 17 Feb 2016, 13:01:56 UTC - in response to Message 2037. Should be working again. ID: 2038 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 2039 - Posted: 17 Feb 2016, 13:27:06 UTC - in response to Message 2038. Should be working again. Fixed! ID: 2039 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 2041 - Posted: 17 Feb 2016, 14:17:11 UTC - in response to Message 2030. We appear to have a problem. Large numbers of jobs stuck in WNPostProc on server. Some jobs have 2 WNPostProc running simultaneously. Oops, my bad -- probably not the JLD. The /home partition filled up with the Condor logs and took me by surprise. We've been going through jobs a leetle bit more quickly just lately. Archived off logs from a couple more completed batches. Must remember that symptom, we've had it before! ID: 2041 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 2048 - Posted: 17 Feb 2016, 22:09:54 UTC Not sure what to make of the current wall-time plot. Must be a few old jobs turning up after a long time, as there aren't many failures in the middle graph. ID: 2048 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2055 - Posted: 18 Feb 2016, 13:59:02 UTC Another runaway. Responsible for almost all 134 exit codes. Caused 19 of currently 65 fails. I am not reporting these any more. ID: 2055 · Rating: 0 · rate: / Reply Quote

Laurence Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129	Message 2057 - Posted: 18 Feb 2016, 16:29:51 UTC - in response to Message 2055. It is on the todo list. Will start working on it once the suspend-resume issue is fixed and I have released boinc v7.6.22 on Fedora. A protection for this should come at the same time as handling the situation with no jobs. ID: 2057 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2060 - Posted: 19 Feb 2016, 8:41:37 UTC I have a job, that ran to event 5 and then started a new job, not a new run and continued. The job is listed in dashboard as running. Logs are available, if someone wants them. ID: 2060 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2063 - Posted: 19 Feb 2016, 10:57:34 UTC Last modified: 19 Feb 2016, 11:26:08 UTC Jobs in WNPostProc state are going trough the roof. I really think, you need to do something about that. edit: The last "finished" reported job was yesterday at 20.45 UTC. ID: 2063 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 2068 - Posted: 19 Feb 2016, 23:31:16 UTC - in response to Message 2063. Last modified: 19 Feb 2016, 23:32:42 UTC I have no idea what's going on. My Linux machine is sending _condor_stdout files that are totally fine, but the jobs end up in WNPostProc state and no log files appear on the server. It's not a disk-space problem this time: [cms005@lcggwms02:~] > df -h . Filesystem Size Used Avail Use% Mounted on /dev/sda2 9.9G 6.6G 2.9G 70% / I'm having to leave this to Laurence et al. to sort out, it's beyond my ken. Oh, according to Condor we're almost out of jobs, so I submitted another 10,000. ID: 2068 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2069 - Posted: 19 Feb 2016, 23:57:46 UTC - in response to Message 2068. Thanks, Ivan. As vLHC does not put any jobs on, we should not need them soon(still 6000 to go). I am sick and tired of reminding them, so s$%&* them.Nobody cares. Maybe this project could allow 2 or 3 tasks or better, enable multi-core operation. I will not start another task, until i can verify,that the jobs are validated(for all i know, they could all be failing) ID: 2069 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 2095 - Posted: 26 Feb 2016, 14:37:53 UTC Last modified: 26 Feb 2016, 14:39:50 UTC Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started: 02/26/16 12:49:56 (pid:7634) Create_Process succeeded, pid=7640 02/26/16 12:50:00 (pid:7634) Got SIGTERM. Performing graceful shutdown. ID: 2095 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2096 - Posted: 26 Feb 2016, 14:41:32 UTC - in response to Message 2095. Maybe they just ran out of jobs. There were only 100. All gone. ID: 2096 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 2097 - Posted: 26 Feb 2016, 15:18:19 UTC - in response to Message 2096. Maybe they just ran out of jobs. There were only 100. All gone. New ones in the pipeline: 7077 jobs of the batch 160226_150549:ireid_crab_CMS_at_Home_MinBias_250evE and increasing ID: 2097 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 2099 - Posted: 26 Feb 2016, 16:10:00 UTC - in response to Message 2095. Again I got the mysterious kill of a cmsRun (jobNumber 9) only 4 seconds after it started: 02/26/16 12:49:56 (pid:7634) Create_Process succeeded, pid=7640 02/26/16 12:50:00 (pid:7634) Got SIGTERM. Performing graceful shutdown. That job had two retries before it finally returned a log file: ls -l 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.*txt -rw-r--r-- 1 cms005 cms005 0 Feb 26 11:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.0.txt -rw-r--r-- 1 cms005 cms005 0 Feb 26 12:01 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.1.txt -rw-r--r-- 1 cms005 cms005 138693 Feb 26 12:56 160226_104947:ireid_crab_CMS_at_Home_MinBias_250evTest/job_out.9.2.txt Can't tell much from that. ID: 2099 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2100 - Posted: 26 Feb 2016, 18:58:08 UTC I have one of these. Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333... ++ pwd + python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=53 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_53.txt --runAndLumis=job_lumis_53.json --lheInputFiles=False --firstEvent=13001 --firstLumi=157 --lastEvent=13251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0 Job Running time in seconds: 3 Job runtime is less than 20minutes. Sleeping 1197 ==== CMSRunAnalysis.py STARTING at Fri Feb 26 16:35:01 2016 ==== Why is it still referring to 100events instead of 250? ID: 2100 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2103 - Posted: 26 Feb 2016, 19:22:46 UTC In the batch of 100 there are, again, 2 jobs, that are stuck in Postproc for more than 5 hours now. ID: 2103 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 2104 - Posted: 26 Feb 2016, 19:25:15 UTC - in response to Message 2100. I have one of these. Now running the CMSRunAnalysis.py job in /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333... ++ pwd + python CMSRunAnalysis.py -r /home/boinc/CMSRun/glide_CgDUbU/execute/dir_18333 -a sandbox.tar.gz --sourceURL=https://cmsweb.cern.ch/crabcache --jobNumber=53 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_53.txt --runAndLumis=job_lumis_53.json --lheInputFiles=False --firstEvent=13001 --firstLumi=157 --lastEvent=13251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 '--scriptArgs=[]' -o '{}' --oneEventMode=0 Job Running time in seconds: 3 Job runtime is less than 20minutes. Sleeping 1197 ==== CMSRunAnalysis.py STARTING at Fri Feb 26 16:35:01 2016 ==== Why is it still referring to 100events instead of 250? That's 100 events per "luminosity section". In the real world the luminosity of the LHC beams dies away roughly exponentially due to various losses. We can't efficiently monitor the luminosity for every bunch-crossing, but we can define a period of time short enough that the luminosity doesn't change significantly over it, and then define (tabulate or parametrise?) the luminosity for each of these "lumis", to be used in working out collision cross-sections. I'm not sure how the period is defined, I think I saw it once as some power-of-two count of the bunch crossings. Since the calculated Monte-Carlo data have to look as real as possible, they also have "lumi sections". I'm not sure how CRAB (or CMSSW) decided to calculate the events-per-lumi for these batches, I was under the impression that each job had to be a LS or a multiple of one to allow run-merging. In fact the numbers don't add up -- first event of job 53 should be 13001, but how do you get a lumi number of 157? Ah, they must be taking every job as 3 lumis, events 1-100=>1, 101-200=>2, and 201-250=>3, then the 53rd job would be starting at lumi 52x3+1=157. I suppose it makes sense to someone... ID: 2104 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 2107 - Posted: 26 Feb 2016, 19:51:18 UTC - in response to Message 2104. Thanks for the explanation. I just wanted to make sure, that it was not an error of some kind. ID: 2107 · Rating: 0 · rate: / Reply Quote

Development for LHC@home