Message boards :
News :
New jobs available
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
I've now submitted a larger batch of jobs since the failure rate seems manageable. There were a few host IP addresses recurring amongst the failures, I'll keep an eye out out for them in future and contact the owners if they continue to misbehave. You can start running tasks again now if you wish. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
You can start running tasks again now if you wish. Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Edit. Times and IP shown in Dashboard are correct.... so far. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Probably re-synching changed files in cvmfs. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Sorry to say: Job 71 will not return from me. I suspended the task in BOINC and instead of saving the VM, the VM stopped. So after resume the VM could not restore from a saved point and had to reboot. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed. Again, dashboard info seems OK. It will interesting to see what dashboard shows when a job fails. Edit. Just noticed this:- *** G4Exception : had012 issued by G4HadronicProcess:CheckResult() Warning: Bad energy non-conservation detected, will re-sample the interaction Process / Model: NeutronInelastic / FTFP Primary: neutron (2112), E= 20055.7, target nucleus (6,12) E(initial - final) = 7856.64 MeV. *** This is just a warning message That looks good, maybe it means we get something for nothing. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
First attempt of Job 150 failed: 10/20/15 17:48:09 (pid:13725) Initialized IO Proxy. 10/20/15 17:48:09 (pid:13725) Done setting resource limits 10/20/15 17:48:09 (pid:13725) FILETRANSFER: "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/20/15 17:48:09 (pid:13725) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/20/15 17:48:09 (pid:13725) Got SIGTERM. Performing graceful shutdown. 10/20/15 17:48:09 (pid:13725) ShutdownGraceful all jobs. 10/20/15 17:48:10 (pid:13725) ERROR "FileTransfer::UpLoadFiles called during active transfer! " at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp 10/20/15 17:48:10 (pid:13725) ShutdownFast all jobs. 10/20/15 17:48:10 (pid:13728) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 156053.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh). 10/20/15 17:48:10 (pid:13731) ****************************************************** |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
My machine sits here for hours now and seems to Loop through something endless: from http://localhost:52386/logs/run-1/glide_ETgbjq/StartdLog ----------------------------------------- 10/20/15 17:50:38 (pid:7603) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#102700 10/20/15 18:00:13 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:00:13 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:11:29 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:11:29 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:22:45 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:22:45 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:34:01 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:34:01 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:45:18 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:45:18 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:56:34 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:56:34 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:07:50 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:07:50 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:19:06 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:19:06 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:30:22 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:30:22 (pid:7603) Buf::write(): condor_write() failed ---------------------------------------------------------- And this is last 5 minutes from http://localhost:52386/logs/run-1/glide_ETgbjq/ProcLog: ---------------------------------------------------------- 10/20/15 19:31:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:00 : gathering usage data for family with root pid 11118 10/20/15 19:31:01 : taking a snapshot... 10/20/15 19:31:01 : ...snapshot complete 10/20/15 19:31:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:05 : gathering usage data for family with root pid 11118 10/20/15 19:31:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:10 : gathering usage data for family with root pid 11118 10/20/15 19:31:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:15 : gathering usage data for family with root pid 11118 10/20/15 19:31:16 : taking a snapshot... 10/20/15 19:31:16 : ...snapshot complete 10/20/15 19:31:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:20 : gathering usage data for family with root pid 11118 10/20/15 19:31:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:25 : gathering usage data for family with root pid 11118 10/20/15 19:31:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:30 : gathering usage data for family with root pid 11118 10/20/15 19:31:31 : taking a snapshot... 10/20/15 19:31:31 : ...snapshot complete 10/20/15 19:31:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:35 : gathering usage data for family with root pid 11118 10/20/15 19:31:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:40 : gathering usage data for family with root pid 11118 10/20/15 19:31:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:45 : gathering usage data for family with root pid 11118 10/20/15 19:31:46 : taking a snapshot... 10/20/15 19:31:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:31:46 : ...snapshot complete 10/20/15 19:31:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:50 : gathering usage data for family with root pid 11118 10/20/15 19:31:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:55 : gathering usage data for family with root pid 11118 10/20/15 19:32:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:00 : gathering usage data for family with root pid 11118 10/20/15 19:32:01 : taking a snapshot... 10/20/15 19:32:01 : ...snapshot complete 10/20/15 19:32:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:05 : gathering usage data for family with root pid 11118 10/20/15 19:32:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:10 : gathering usage data for family with root pid 11118 10/20/15 19:32:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:15 : gathering usage data for family with root pid 11118 10/20/15 19:32:16 : taking a snapshot... 10/20/15 19:32:16 : ...snapshot complete 10/20/15 19:32:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:20 : gathering usage data for family with root pid 11118 10/20/15 19:32:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:25 : gathering usage data for family with root pid 11118 10/20/15 19:32:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:30 : gathering usage data for family with root pid 11118 10/20/15 19:32:31 : taking a snapshot... 10/20/15 19:32:31 : ...snapshot complete 10/20/15 19:32:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:35 : gathering usage data for family with root pid 11118 10/20/15 19:32:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:40 : gathering usage data for family with root pid 11118 10/20/15 19:32:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:45 : gathering usage data for family with root pid 11118 10/20/15 19:32:46 : taking a snapshot... 10/20/15 19:32:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:32:46 : ...snapshot complete 10/20/15 19:32:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:50 : gathering usage data for family with root pid 11118 10/20/15 19:32:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:55 : gathering usage data for family with root pid 11118 10/20/15 19:33:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:00 : gathering usage data for family with root pid 11118 10/20/15 19:33:01 : taking a snapshot... 10/20/15 19:33:01 : ...snapshot complete 10/20/15 19:33:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:05 : gathering usage data for family with root pid 11118 10/20/15 19:33:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:10 : gathering usage data for family with root pid 11118 10/20/15 19:33:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:15 : gathering usage data for family with root pid 11118 10/20/15 19:33:16 : taking a snapshot... 10/20/15 19:33:16 : ...snapshot complete 10/20/15 19:33:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:20 : gathering usage data for family with root pid 11118 10/20/15 19:33:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:25 : gathering usage data for family with root pid 11118 10/20/15 19:33:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:30 : gathering usage data for family with root pid 11118 10/20/15 19:33:31 : taking a snapshot... 10/20/15 19:33:31 : ...snapshot complete 10/20/15 19:33:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:35 : gathering usage data for family with root pid 11118 10/20/15 19:33:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:40 : gathering usage data for family with root pid 11118 10/20/15 19:33:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:45 : gathering usage data for family with root pid 11118 10/20/15 19:33:46 : taking a snapshot... 10/20/15 19:33:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:33:46 : ...snapshot complete 10/20/15 19:33:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:50 : gathering usage data for family with root pid 11118 10/20/15 19:33:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:55 : gathering usage data for family with root pid 11118 10/20/15 19:34:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:00 : gathering usage data for family with root pid 11118 10/20/15 19:34:01 : taking a snapshot... 10/20/15 19:34:01 : ...snapshot complete 10/20/15 19:34:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:05 : gathering usage data for family with root pid 11118 10/20/15 19:34:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:10 : gathering usage data for family with root pid 11118 10/20/15 19:34:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:15 : gathering usage data for family with root pid 11118 10/20/15 19:34:16 : taking a snapshot... 10/20/15 19:34:16 : ...snapshot complete 10/20/15 19:34:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:20 : gathering usage data for family with root pid 11118 10/20/15 19:34:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:25 : gathering usage data for family with root pid 11118 10/20/15 19:34:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:30 : gathering usage data for family with root pid 11118 10/20/15 19:34:31 : taking a snapshot... 10/20/15 19:34:31 : ...snapshot complete 10/20/15 19:34:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:35 : gathering usage data for family with root pid 11118 10/20/15 19:34:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:40 : gathering usage data for family with root pid 11118 10/20/15 19:34:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:45 : gathering usage data for family with root pid 11118 10/20/15 19:34:46 : taking a snapshot... 10/20/15 19:34:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:34:46 : ...snapshot complete 10/20/15 19:34:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:50 : gathering usage data for family with root pid 11118 10/20/15 19:34:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:55 : gathering usage data for family with root pid 11118 10/20/15 19:35:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:00 : gathering usage data for family with root pid 11118 10/20/15 19:35:01 : taking a snapshot... 10/20/15 19:35:01 : ...snapshot complete 10/20/15 19:35:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:05 : gathering usage data for family with root pid 11118 10/20/15 19:35:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:10 : gathering usage data for family with root pid 11118 10/20/15 19:35:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:15 : gathering usage data for family with root pid 11118 10/20/15 19:35:16 : taking a snapshot... 10/20/15 19:35:16 : ...snapshot complete 10/20/15 19:35:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:20 : gathering usage data for family with root pid 11118 10/20/15 19:35:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:25 : gathering usage data for family with root pid 11118 10/20/15 19:35:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:31 : gathering usage data for family with root pid 11118 10/20/15 19:35:31 : taking a snapshot... 10/20/15 19:35:31 : ...snapshot complete 10/20/15 19:35:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:36 : gathering usage data for family with root pid 11118 10/20/15 19:35:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:41 : gathering usage data for family with root pid 11118 10/20/15 19:35:46 : taking a snapshot... 10/20/15 19:35:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683 10/20/15 19:35:46 : ...snapshot complete 10/20/15 19:35:46 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:46 : gathering usage data for family with root pid 11118 10/20/15 19:35:51 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:51 : gathering usage data for family with root pid 11118 10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:56 : gathering usage data for family with root pid 11118 10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:56 : gathering usage data for family with root pid 11118 10/20/15 19:36:01 : taking a snapshot... 10/20/15 19:36:01 : ...snapshot complete 10/20/15 19:36:01 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:01 : gathering usage data for family with root pid 11118 10/20/15 19:36:06 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:06 : gathering usage data for family with root pid 11118 10/20/15 19:36:11 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:11 : gathering usage data for family with root pid 11118 10/20/15 19:36:16 : taking a snapshot... 10/20/15 19:36:16 : ...snapshot complete 10/20/15 19:36:16 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:16 : gathering usage data for family with root pid 11118 10/20/15 19:36:21 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:21 : gathering usage data for family with root pid 11118 10/20/15 19:36:26 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:26 : gathering usage data for family with root pid 11118 10/20/15 19:36:31 : taking a snapshot... 10/20/15 19:36:31 : ...snapshot complete 10/20/15 19:36:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:31 : gathering usage data for family with root pid 11118 10/20/15 19:36:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:36 : gathering usage data for family with root pid 11118 10/20/15 19:36:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:41 : gathering usage data for family with root pid 11118 10/20/15 19:36:46 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:46 : gathering usage data for family with root pid 11118 10/20/15 19:36:46 : taking a snapshot... 10/20/15 19:36:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683 10/20/15 19:36:46 : ...snapshot complete 10/20/15 19:36:51 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:51 : gathering usage data for family with root pid 11118 10/20/15 19:36:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:56 : gathering usage data for family with root pid 11118 10/20/15 19:37:01 : taking a snapshot... 10/20/15 19:37:01 : ...snapshot complete 10/20/15 19:37:01 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:01 : gathering usage data for family with root pid 11118 10/20/15 19:37:06 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:06 : gathering usage data for family with root pid 11118 10/20/15 19:37:11 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:11 : gathering usage data for family with root pid 11118 10/20/15 19:37:16 : taking a snapshot... 10/20/15 19:37:16 : ...snapshot complete 10/20/15 19:37:16 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:16 : gathering usage data for family with root pid 11118 10/20/15 19:37:21 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:21 : gathering usage data for family with root pid 11118 10/20/15 19:37:26 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:26 : gathering usage data for family with root pid 11118 10/20/15 19:37:31 : taking a snapshot... 10/20/15 19:37:31 : ...snapshot complete 10/20/15 19:37:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:31 : gathering usage data for family with root pid 11118 10/20/15 19:37:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:36 : gathering usage data for family with root pid 11118 10/20/15 19:37:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:41 : gathering usage data for family with root pid 11118 10/20/15 19:37:46 : taking a snapshot... 10/20/15 19:37:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:37:46 : ...snapshot complete 10/20/15 19:37:46 : PROC_FAMILY_GET_USAGE |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
My machine sits here for hours now and seems to Loop through something endless: Thanks to Microsoft, they had published a patch, this forced my Desktop to reboot and now I'm back crunching |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. As far as I understand it, post-processing is done on the server, given that before it's done the job output file says "waiing for postproc" and the postproc log says "queued for postproc". In any event, output for 35 showed up on the databridge at 1407 UTC, so Condor, which feeds Dashboard, was being slow to report. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. Ah, sorry, my eyes aren't what they used to be. (Omnes: They used to be my ears!) Let's see; 94 returned a result at 0353 Wed, 72 at 0350, 104 at 0715, and and 152 at 1611 Tues. I've not been able to get job logs from Dashboard either, they get served up from Condor -- its Web server may need tickling -- and I'd expect they come from the output files I see, which aren't available until after post-processing. Hmm, yes, http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txtThat's some sort of a redirect to my home directory. I'll ask Andrew to check it. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. Something very strange is going on... maybe the workings of a cache or cleanup script somewhere. Job 94 now shows as "finished" as do 72 and 104 whilst 152 is back to "running". No retries. The only job of mine is 94 and the start and finish times shown are different (and wrong) by ca +13h from those shown when in the "wnpostproc" state, which were correct. As I remember, job 72 has been similarly affected. Maybe the times aren't supposed to refer to the actual processing, which the user doesn't care about, but to the availability of a result, which they do. BUT this only really applies to the finish time so why is the start time altered as well? Dashboard Strikes Again (must remember no article capital D capital D capital D...) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
I long ago gave up trying to decipher the workings of Dashboard... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txtThat's some sort of a redirect to my home directory. I'll ask Andrew to check it. Turns out it's not easily activated -- port 80 doesn't get through their firewall for a start, so let's not pursue it at this time. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
OK, thanks. |
Send message Joined: 17 Aug 15 Posts: 17 Credit: 228,358 RAC: 0 |
In running the new CMS-dev jobs on my i7-4790 (Win7 64-bit, Asrock Z97 motherboard), I am starting to see "Error ID 19", which is an "Internal parity error". This machine is not overclocked in any way, and I have never seen this error before in any circumstance. However, I am now running two vLHC jobs also, which I hadn't done before with the CMS-dev, so it may be the combination that is triggering the error. It may not really be an error, but more of a bug, since it has all the same symptoms as this: https://communities.vmware.com/thread/471348 At present, it is occurring about every 1/2 hour to 1 1/2 hours, and the processor ID may be different. It has not caused any operational difficulties or crashes thus far. In case anyone is interested, the full event is: Log Name: System |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 15 |
Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures: Still running 1 and only 11 failed. In WNPostProc state 2 (Don't know why post processing takes that long for those 2) Success 1,986 From those success jobs: 28 needed 3 attempts to finish and 249 succeeded after 2 attemps. IMO that number of extra attemps is excessive. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,304,147 RAC: 3,524 |
Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures: I would agree. You just reminded me that I never got a reply from the chap with the excessive failures, I'll have to try e-mail rather than PM. If you check the failures and sort by IP, he had ~90 of the failures. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Has anybody carefully followed what happens to those jobs that are abandoned when the host is shutdown/rebooted or whatever? Could they show up like this:- "N/A / Error return without specification"? Many initial failures do. Maybe Condor puts the "error" bit in because some timeout or heartbeat fails, there certainly wouldn't be a "specification". They all require at least one retry. At the moment there will probably be a higher rate of these events because people are "poking around", looking for problems and suchlike. Many hosts (mine...) don't run continuously, that tends towards one extra attempt per host per day. IMO you can't reasonably require volunteers to run machines continuously, although I realise many do. |
©2024 CERN