Message boards :
Number crunching :
issue of the day
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 11 · Next
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 334,882 RAC: 0 |
I think we are out of jobs again. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks. It would be nice to post a message, when such things occur. That is not too much to ask,is it? I like to help, but i also do not want to waste my recources for no reason. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
Sorry, I thought I'd made it clear. We're short of jobs at the moment. I can submit new ones but we're chasing a bug where both the Condor server and the Dashboard reporting service don't get the message that a job has successfully completed and written its output to storage. So each job gets run three times. Which is a waste of your computers. We've made progress on other problems yesterday, but this one is eluding us; I'm running several short jobs from time to time to get logs for Laurence et al., but I don't have the expertise or the access that they have to attack the problem. I could, and possibly may, just run a large batch of jobs and ignore the fact that most jobs are run three times too many -- you guys get the same credit one way or the other, but we do need to get this sorted before we go anywhere near a beta release. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Strange directory usage! First, it generates run-1, run-2, etc.(no work at that time) Then, at run-33 i got some jobs.At some point during run-33, it decided to generate additional logs in run-1 again, up to run-30. Then it started to put new logs into run-1 again and is currently up run-7.If there is any logic to this, it escapes me. I hope, there is useful info in the results. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
Yeah, job availability is sporadic at the moment. The generation of new log directories every 15(?) minutes is a consequence of there being no work; looks like a timeout we could probably adjust. Just waiting to hear from the development team on their conclusions from last night's mini-blitz; I'm very tempted to just go for a large-scale test over the weekend (it's a three-day holiday weekend here). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I understand, that there are only a few jobs. If this messy directory structure is not a problem---fine with me. I just thought, i mention it. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,111,068 RAC: 8,460 |
The log does say it is "Resetting run number to rotate logs" so goes from 1 to 30 now. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
I just killed a bunch of jobs to submit some new ideas from the team. This will probably show some weirdness in your VMs but as I just said to Ben, "DON'T PANIC!" |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 25 |
I paused the VM just after the start of JobNumber 5143 and resumed the VM 2 hours later. A new job (5250) started and 5143 never returned, also no error. In CMS-Dashboard 5143 is still in running state and probably will time out after 24 hours. cmsRun-stdout.log of job 5143: Beginning CMSSW wrapper script slc6_amd64_gcc472 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14417/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py From StarterLog: 10/06/15 15:09:49 (pid:14417) Using wrapper /home/boinc/CMSRun/glide_hY3D6I/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14417/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=5143 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_5143.txt --runAndLumis=job_lumis_5143.json --lheInputFiles=False --firstEvent=128551 --firstLumi=5143 --lastEvent=128576 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 10/06/15 15:09:49 (pid:14417) Running job as user (null) 10/06/15 15:09:49 (pid:14417) Create_Process succeeded, pid=14424 10/06/15 17:15:43 (pid:14777) FILETRANSFER: "/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/06/15 17:15:43 (pid:14777) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/06/15 17:15:43 (pid:14777) WARNING: Initializing plugins returned: FILETRANSFER:1:"/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/06/15 17:16:08 (pid:14796) ****************************************************** 10/06/15 17:16:08 (pid:14796) ** condor_starter (CONDOR_STARTER) STARTING UP 10/06/15 17:16:08 (pid:14796) ** /home/boinc/CMSRun/glide_hY3D6I/main/condor/sbin/condor_starter 10/06/15 17:16:08 (pid:14796) ** SubsystemInfo: name=STARTER type=STARTER(8) class=DAEMON(1) 10/06/15 17:16:08 (pid:14796) ** Configuration: subsystem:STARTER local:<NONE> class:DAEMON 10/06/15 17:16:08 (pid:14796) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 10/06/15 17:16:08 (pid:14796) ** $CondorPlatform: x86_64_RedHat5 $ 10/06/15 17:16:08 (pid:14796) ** PID = 14796 10/06/15 17:16:08 (pid:14796) ** Log last touched 10/6 17:15:43 10/06/15 17:16:08 (pid:14796) ****************************************************** 10/06/15 17:16:08 (pid:14796) Using config source: /home/boinc/CMSRun/glide_hY3D6I/condor_config 10/06/15 17:16:08 (pid:14796) config Macros = 212, Sorted = 212, StringBytes = 10694, TablesBytes = 7672 10/06/15 17:16:08 (pid:14796) CLASSAD_CACHING is OFF 10/06/15 17:16:08 (pid:14796) Daemon Log is logging: D_ALWAYS D_ERROR 10/06/15 17:16:08 (pid:14796) DaemonCore: command socket at <10.0.2.15:58763?noUDP> 10/06/15 17:16:08 (pid:14796) DaemonCore: private command socket at <10.0.2.15:58763> 10/06/15 17:16:09 (pid:14796) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#89568 10/06/15 17:16:09 (pid:14796) Communicating with shadow <130.246.180.120:9818?noUDP&sock=20016_d29f_56892> 10/06/15 17:16:09 (pid:14796) Submitting machine is "lcggwms02.gridpp.rl.ac.uk" 10/06/15 17:16:09 (pid:14796) setting the orig job name in starter 10/06/15 17:16:09 (pid:14796) setting the orig job iwd in starter 10/06/15 17:16:09 (pid:14796) Chirp config summary: IO false, Updates false, Delayed updates true. 10/06/15 17:16:09 (pid:14796) Initialized IO Proxy. 10/06/15 17:16:09 (pid:14796) Done setting resource limits 10/06/15 17:16:09 (pid:14796) FILETRANSFER: "/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/06/15 17:16:09 (pid:14796) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_hY3D6I/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/06/15 17:16:10 (pid:14796) File transfer completed successfully. 10/06/15 17:16:10 (pid:14796) Job 138242.0 set to execute immediately 10/06/15 17:16:10 (pid:14796) Starting a VANILLA universe job with ID: 138242.0 10/06/15 17:16:10 (pid:14796) IWD: /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14796 10/06/15 17:16:10 (pid:14796) Output file: /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14796/_condor_stdout 10/06/15 17:16:10 (pid:14796) Error file: /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14796/_condor_stderr 10/06/15 17:16:11 (pid:14796) Renice expr "0" evaluated to 0 10/06/15 17:16:11 (pid:14796) Using wrapper /home/boinc/CMSRun/glide_hY3D6I/condor_job_wrapper.sh to exec /home/boinc/CMSRun/glide_hY3D6I/execute/dir_14796/condor_exec.exe -a sandbox.tar.gz --sourceURL=https://cmsweb-testbed.cern.ch/crabcache --jobNumber=5250 --cmsswVersion=CMSSW_6_2_0_SLHC26_patch3 --scramArch=slc6_amd64_gcc472 --inputFile=job_input_file_list_5250.txt --runAndLumis=job_lumis_5250.json --lheInputFiles=False --firstEvent=131226 --firstLumi=5250 --lastEvent=131251 --firstRun=1 --seeding=AutomaticSeeding --scriptExe=None --eventsPerLumi=100 --scriptArgs=[] -o {} 10/06/15 17:16:11 (pid:14796) Running job as user (null) 10/06/15 17:16:11 (pid:14796) Create_Process succeeded, pid=14800 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
Curious. 5143 appears to have finished (its output is on the DataBridge and its stdout is complete). The Condor node_state.txt doesn't show it was resubmitted: [ Type = "NodeStatus"; Node = "Job5143"; NodeStatus = 5; /* "STATUS_DONE" */ StatusDetails = ""; RetryCount = 0; JobProcsQueued = 0; JobProcsHeld = 0; ] Its successful incarnation started at 0114 UTC: gWMS-CMSRunAnalysis.sh STARTING at Wed Oct 7 01:13:55 GMT 2015 on 246-563-16291 and finished at 0145: gWMS-CMSRunAnalysis.sh FINISHING at Wed Oct 7 01:44:52 GMT 2015 on 246-563-16291 with (short) status 0 I haven't found anything yet in the logs I can access to identify which user/machine it ran on. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
The best I could find is the dashboard job details page which has the public IP of the machine so for those who have fixed IPs might this not help show the user? ps.If jobs could be sorted on this (WNIp) I could see all the ones I had done... or not. Very useful if only I knew how... edit:- 5143 shows StartedRunningTimeStamp 2015-10-07 01:13:57 FinishedTimeStamp 2015-10-07 01:37:40 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
The best I could find is the dashboard job details page which has the public IP of the machine so for those who have fixed IPs might this not help show the user? Which page is that exactly? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Once or twice during a 24h task i get an empty dir xxxx. Condor does not seem to start. Starterlog 10/07/15 13:53:59 (pid:30125) Running job as user (null) |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
The best I could find is the dashboard job details page which has the public IP of the machine so for those who have fixed IPs might this not help show the user? This is an example of another job |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 25 |
Curious. 5143 appears to have finished (its output is on the DataBridge and its stdout is complete). The Condor node_state.txt doesn't show it was resubmitted: This is the BOINC-task serving the jobs (also job 5143). 66911 62053 37 6 Oct 2015, 10:31:55 UTC 6 Oct 2015, 21:37:09 UTC Completed and validated It was completed and reported hours before dashboard times of STARTING and FINISHING of job 5143. Baffling is the fact that Dashboard has stored my WNIp and the job needed only 24 minutes, where my machine for this kind of jobs needed about 1 hour when running the CPU at 100%. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
Following up from m's post I found this page for job 5143, which ties in with what I found on the Condor server. This gets more and more curious. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,111,068 RAC: 8,460 |
Should those 1970 time stamps be set to something more recent ? |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 859,751 RAC: 25 |
Should those 1970 time stamps be set to something more recent ? 1970-01-01 00:00:00 means zero or not (yet) set. Unix time starts counting the seconds since 1970-01-01 00:00:00 UTC If you want to join the next big 2,000,000,000 celebration party, make a notice in your calendar at 18 May 2033 03:33:20 UTC. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,111,068 RAC: 8,460 |
Should those 1970 time stamps be set to something more recent ? I know, and I can understand the ScheduledTimeStamp not getting set but thought the GridFinishedTimeStamp might have had a value. I've put it in my calendar, assume we are all coming to your house for the party ;-) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,175,966 RAC: 1,853 |
The best I could find is the dashboard job details page which has the public IP of the machine so for those who have fixed IPs might this not help show the user? Yeah, thanks, but now I've got a chance to wind down a bit after a hectic day, how did you get there? Dashboard is a bit like the old game of Colossal Cave Adventure: "a maze of twisty little passages, all alike!" |
©2024 CERN