Thread 'New jobs available'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1264 - Posted: 20 Oct 2015, 12:20:57 UTC Last modified: 20 Oct 2015, 12:21:44 UTC I've now submitted a larger batch of jobs since the failure rate seems manageable. There were a few host IP addresses recurring amongst the failures, I'll keep an eye out out for them in future and contact the owners if they continue to misbehave. You can start running tasks again now if you wish. ID: 1264 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1265 - Posted: 20 Oct 2015, 13:01:32 UTC - in response to Message 1264. Last modified: 20 Oct 2015, 13:19:52 UTC You can start running tasks again now if you wish. Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Edit. Times and IP shown in Dashboard are correct.... so far. ID: 1265 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1266 - Posted: 20 Oct 2015, 13:17:19 UTC - in response to Message 1265. Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Probably re-synching changed files in cvmfs. ID: 1266 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1239 Credit: 962,071 RAC: 538	Message 1267 - Posted: 20 Oct 2015, 14:12:59 UTC Sorry to say: Job 71 will not return from me. I suspended the task in BOINC and instead of saving the VM, the VM stopped. So after resume the VM could not restore from a saved point and had to reboot. ID: 1267 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1268 - Posted: 20 Oct 2015, 14:28:16 UTC - in response to Message 1266. Last modified: 20 Oct 2015, 15:06:31 UTC Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Probably re-synching changed files in cvmfs. Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed. Again, dashboard info seems OK. It will interesting to see what dashboard shows when a job fails. Edit. Just noticed this:- * G4Exception : had012 issued by G4HadronicProcess:CheckResult() Warning: Bad energy non-conservation detected, will re-sample the interaction Process / Model: NeutronInelastic / FTFP Primary: neutron (2112), E= 20055.7, target nucleus (6,12) E(initial - final) = 7856.64 MeV. * This is just a warning message That looks good, maybe it means we get something for nothing. ID: 1268 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1269 - Posted: 20 Oct 2015, 16:06:31 UTC - in response to Message 1268. Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Probably re-synching changed files in cvmfs. Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed. Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak. ID: 1269 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1239 Credit: 962,071 RAC: 538	Message 1270 - Posted: 20 Oct 2015, 17:06:15 UTC First attempt of Job 150 failed: 10/20/15 17:48:09 (pid:13725) Initialized IO Proxy. 10/20/15 17:48:09 (pid:13725) Done setting resource limits 10/20/15 17:48:09 (pid:13725) FILETRANSFER: "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/20/15 17:48:09 (pid:13725) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring 10/20/15 17:48:09 (pid:13725) Got SIGTERM. Performing graceful shutdown. 10/20/15 17:48:09 (pid:13725) ShutdownGraceful all jobs. 10/20/15 17:48:10 (pid:13725) ERROR "FileTransfer::UpLoadFiles called during active transfer! " at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp 10/20/15 17:48:10 (pid:13725) ShutdownFast all jobs. 10/20/15 17:48:10 (pid:13728) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 156053.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh). 10/20/15 17:48:10 (pid:13731) ****************************************************** ID: 1270 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0	Message 1271 - Posted: 20 Oct 2015, 17:40:24 UTC My machine sits here for hours now and seems to Loop through something endless: from http://localhost:52386/logs/run-1/glide_ETgbjq/StartdLog ----------------------------------------- 10/20/15 17:50:38 (pid:7603) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#102700 10/20/15 18:00:13 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:00:13 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:11:29 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:11:29 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:22:45 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:22:45 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:34:01 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:34:01 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:45:18 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:45:18 (pid:7603) Buf::write(): condor_write() failed 10/20/15 18:56:34 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 18:56:34 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:07:50 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:07:50 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:19:06 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:19:06 (pid:7603) Buf::write(): condor_write() failed 10/20/15 19:30:22 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8 10/20/15 19:30:22 (pid:7603) Buf::write(): condor_write() failed ---------------------------------------------------------- And this is last 5 minutes from http://localhost:52386/logs/run-1/glide_ETgbjq/ProcLog: ---------------------------------------------------------- 10/20/15 19:31:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:00 : gathering usage data for family with root pid 11118 10/20/15 19:31:01 : taking a snapshot... 10/20/15 19:31:01 : ...snapshot complete 10/20/15 19:31:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:05 : gathering usage data for family with root pid 11118 10/20/15 19:31:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:10 : gathering usage data for family with root pid 11118 10/20/15 19:31:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:15 : gathering usage data for family with root pid 11118 10/20/15 19:31:16 : taking a snapshot... 10/20/15 19:31:16 : ...snapshot complete 10/20/15 19:31:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:20 : gathering usage data for family with root pid 11118 10/20/15 19:31:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:25 : gathering usage data for family with root pid 11118 10/20/15 19:31:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:30 : gathering usage data for family with root pid 11118 10/20/15 19:31:31 : taking a snapshot... 10/20/15 19:31:31 : ...snapshot complete 10/20/15 19:31:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:35 : gathering usage data for family with root pid 11118 10/20/15 19:31:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:40 : gathering usage data for family with root pid 11118 10/20/15 19:31:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:45 : gathering usage data for family with root pid 11118 10/20/15 19:31:46 : taking a snapshot... 10/20/15 19:31:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:31:46 : ...snapshot complete 10/20/15 19:31:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:50 : gathering usage data for family with root pid 11118 10/20/15 19:31:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:31:55 : gathering usage data for family with root pid 11118 10/20/15 19:32:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:00 : gathering usage data for family with root pid 11118 10/20/15 19:32:01 : taking a snapshot... 10/20/15 19:32:01 : ...snapshot complete 10/20/15 19:32:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:05 : gathering usage data for family with root pid 11118 10/20/15 19:32:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:10 : gathering usage data for family with root pid 11118 10/20/15 19:32:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:15 : gathering usage data for family with root pid 11118 10/20/15 19:32:16 : taking a snapshot... 10/20/15 19:32:16 : ...snapshot complete 10/20/15 19:32:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:20 : gathering usage data for family with root pid 11118 10/20/15 19:32:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:25 : gathering usage data for family with root pid 11118 10/20/15 19:32:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:30 : gathering usage data for family with root pid 11118 10/20/15 19:32:31 : taking a snapshot... 10/20/15 19:32:31 : ...snapshot complete 10/20/15 19:32:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:35 : gathering usage data for family with root pid 11118 10/20/15 19:32:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:40 : gathering usage data for family with root pid 11118 10/20/15 19:32:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:45 : gathering usage data for family with root pid 11118 10/20/15 19:32:46 : taking a snapshot... 10/20/15 19:32:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:32:46 : ...snapshot complete 10/20/15 19:32:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:50 : gathering usage data for family with root pid 11118 10/20/15 19:32:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:32:55 : gathering usage data for family with root pid 11118 10/20/15 19:33:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:00 : gathering usage data for family with root pid 11118 10/20/15 19:33:01 : taking a snapshot... 10/20/15 19:33:01 : ...snapshot complete 10/20/15 19:33:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:05 : gathering usage data for family with root pid 11118 10/20/15 19:33:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:10 : gathering usage data for family with root pid 11118 10/20/15 19:33:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:15 : gathering usage data for family with root pid 11118 10/20/15 19:33:16 : taking a snapshot... 10/20/15 19:33:16 : ...snapshot complete 10/20/15 19:33:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:20 : gathering usage data for family with root pid 11118 10/20/15 19:33:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:25 : gathering usage data for family with root pid 11118 10/20/15 19:33:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:30 : gathering usage data for family with root pid 11118 10/20/15 19:33:31 : taking a snapshot... 10/20/15 19:33:31 : ...snapshot complete 10/20/15 19:33:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:35 : gathering usage data for family with root pid 11118 10/20/15 19:33:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:40 : gathering usage data for family with root pid 11118 10/20/15 19:33:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:45 : gathering usage data for family with root pid 11118 10/20/15 19:33:46 : taking a snapshot... 10/20/15 19:33:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:33:46 : ...snapshot complete 10/20/15 19:33:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:50 : gathering usage data for family with root pid 11118 10/20/15 19:33:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:33:55 : gathering usage data for family with root pid 11118 10/20/15 19:34:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:00 : gathering usage data for family with root pid 11118 10/20/15 19:34:01 : taking a snapshot... 10/20/15 19:34:01 : ...snapshot complete 10/20/15 19:34:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:05 : gathering usage data for family with root pid 11118 10/20/15 19:34:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:10 : gathering usage data for family with root pid 11118 10/20/15 19:34:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:15 : gathering usage data for family with root pid 11118 10/20/15 19:34:16 : taking a snapshot... 10/20/15 19:34:16 : ...snapshot complete 10/20/15 19:34:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:20 : gathering usage data for family with root pid 11118 10/20/15 19:34:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:25 : gathering usage data for family with root pid 11118 10/20/15 19:34:30 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:30 : gathering usage data for family with root pid 11118 10/20/15 19:34:31 : taking a snapshot... 10/20/15 19:34:31 : ...snapshot complete 10/20/15 19:34:35 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:35 : gathering usage data for family with root pid 11118 10/20/15 19:34:40 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:40 : gathering usage data for family with root pid 11118 10/20/15 19:34:45 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:45 : gathering usage data for family with root pid 11118 10/20/15 19:34:46 : taking a snapshot... 10/20/15 19:34:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:34:46 : ...snapshot complete 10/20/15 19:34:50 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:50 : gathering usage data for family with root pid 11118 10/20/15 19:34:55 : PROC_FAMILY_GET_USAGE 10/20/15 19:34:55 : gathering usage data for family with root pid 11118 10/20/15 19:35:00 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:00 : gathering usage data for family with root pid 11118 10/20/15 19:35:01 : taking a snapshot... 10/20/15 19:35:01 : ...snapshot complete 10/20/15 19:35:05 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:05 : gathering usage data for family with root pid 11118 10/20/15 19:35:10 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:10 : gathering usage data for family with root pid 11118 10/20/15 19:35:15 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:15 : gathering usage data for family with root pid 11118 10/20/15 19:35:16 : taking a snapshot... 10/20/15 19:35:16 : ...snapshot complete 10/20/15 19:35:20 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:20 : gathering usage data for family with root pid 11118 10/20/15 19:35:25 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:25 : gathering usage data for family with root pid 11118 10/20/15 19:35:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:31 : gathering usage data for family with root pid 11118 10/20/15 19:35:31 : taking a snapshot... 10/20/15 19:35:31 : ...snapshot complete 10/20/15 19:35:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:36 : gathering usage data for family with root pid 11118 10/20/15 19:35:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:41 : gathering usage data for family with root pid 11118 10/20/15 19:35:46 : taking a snapshot... 10/20/15 19:35:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683 10/20/15 19:35:46 : ...snapshot complete 10/20/15 19:35:46 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:46 : gathering usage data for family with root pid 11118 10/20/15 19:35:51 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:51 : gathering usage data for family with root pid 11118 10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:56 : gathering usage data for family with root pid 11118 10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:35:56 : gathering usage data for family with root pid 11118 10/20/15 19:36:01 : taking a snapshot... 10/20/15 19:36:01 : ...snapshot complete 10/20/15 19:36:01 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:01 : gathering usage data for family with root pid 11118 10/20/15 19:36:06 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:06 : gathering usage data for family with root pid 11118 10/20/15 19:36:11 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:11 : gathering usage data for family with root pid 11118 10/20/15 19:36:16 : taking a snapshot... 10/20/15 19:36:16 : ...snapshot complete 10/20/15 19:36:16 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:16 : gathering usage data for family with root pid 11118 10/20/15 19:36:21 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:21 : gathering usage data for family with root pid 11118 10/20/15 19:36:26 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:26 : gathering usage data for family with root pid 11118 10/20/15 19:36:31 : taking a snapshot... 10/20/15 19:36:31 : ...snapshot complete 10/20/15 19:36:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:31 : gathering usage data for family with root pid 11118 10/20/15 19:36:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:36 : gathering usage data for family with root pid 11118 10/20/15 19:36:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:41 : gathering usage data for family with root pid 11118 10/20/15 19:36:46 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:46 : gathering usage data for family with root pid 11118 10/20/15 19:36:46 : taking a snapshot... 10/20/15 19:36:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683 10/20/15 19:36:46 : ...snapshot complete 10/20/15 19:36:51 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:51 : gathering usage data for family with root pid 11118 10/20/15 19:36:56 : PROC_FAMILY_GET_USAGE 10/20/15 19:36:56 : gathering usage data for family with root pid 11118 10/20/15 19:37:01 : taking a snapshot... 10/20/15 19:37:01 : ...snapshot complete 10/20/15 19:37:01 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:01 : gathering usage data for family with root pid 11118 10/20/15 19:37:06 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:06 : gathering usage data for family with root pid 11118 10/20/15 19:37:11 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:11 : gathering usage data for family with root pid 11118 10/20/15 19:37:16 : taking a snapshot... 10/20/15 19:37:16 : ...snapshot complete 10/20/15 19:37:16 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:16 : gathering usage data for family with root pid 11118 10/20/15 19:37:21 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:21 : gathering usage data for family with root pid 11118 10/20/15 19:37:26 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:26 : gathering usage data for family with root pid 11118 10/20/15 19:37:31 : taking a snapshot... 10/20/15 19:37:31 : ...snapshot complete 10/20/15 19:37:31 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:31 : gathering usage data for family with root pid 11118 10/20/15 19:37:36 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:36 : gathering usage data for family with root pid 11118 10/20/15 19:37:41 : PROC_FAMILY_GET_USAGE 10/20/15 19:37:41 : gathering usage data for family with root pid 11118 10/20/15 19:37:46 : taking a snapshot... 10/20/15 19:37:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682 10/20/15 19:37:46 : ...snapshot complete 10/20/15 19:37:46 : PROC_FAMILY_GET_USAGE ID: 1271 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0	Message 1272 - Posted: 20 Oct 2015, 18:55:17 UTC - in response to Message 1271. My machine sits here for hours now and seems to Loop through something endless: Thanks to Microsoft, they had published a patch, this forced my Desktop to reboot and now I'm back crunching ID: 1272 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1273 - Posted: 20 Oct 2015, 19:19:11 UTC - in response to Message 1269. Ok. Job 35 is running here and, apart from much downloading to start with, is running OK. Probably re-synching changed files in cvmfs. Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed. Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak. As far as I understand it, post-processing is done on the server, given that before it's done the job output file says "waiing for postproc" and the postproc log says "queued for postproc". In any event, output for 35 showed up on the databridge at 1407 UTC, so Condor, which feeds Dashboard, was being slow to report. ID: 1273 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1274 - Posted: 20 Oct 2015, 19:46:49 UTC - in response to Message 1273. Last modified: 20 Oct 2015, 19:56:32 UTC Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed. Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak. As far as I understand it, post-processing is done on the server, given that before it's done the job output file says "waiing for postproc" and the postproc log says "queued for postproc". In any event, output for 35 showed up on the databridge at 1407 UTC, so Condor, which feeds Dashboard, was being slow to report. Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials. ID: 1274 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1275 - Posted: 21 Oct 2015, 7:38:53 UTC - in response to Message 1274. Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials. Ah, sorry, my eyes aren't what they used to be. (Omnes: They used to be my ears!) Let's see; 94 returned a result at 0353 Wed, 72 at 0350, 104 at 0715, and and 152 at 1611 Tues. I've not been able to get job logs from Dashboard either, they get served up from Condor -- its Web server may need tickling -- and I'd expect they come from the output files I see, which aren't available until after post-processing. Hmm, yes, http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt That's some sort of a redirect to my home directory. I'll ask Andrew to check it. ID: 1275 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1276 - Posted: 21 Oct 2015, 9:19:06 UTC - in response to Message 1275. Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck. The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials. Ah, sorry, my eyes aren't what they used to be. (Omnes: They used to be my ears!) Let's see; 94 returned a result at 0353 Wed, 72 at 0350, 104 at 0715, and and 152 at 1611 Tues. I've not been able to get job logs from Dashboard either, they get served up from Condor -- its Web server may need tickling -- and I'd expect they come from the output files I see, which aren't available until after post-processing. Hmm, yes, http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt That's some sort of a redirect to my home directory. I'll ask Andrew to check it. Something very strange is going on... maybe the workings of a cache or cleanup script somewhere. Job 94 now shows as "finished" as do 72 and 104 whilst 152 is back to "running". No retries. The only job of mine is 94 and the start and finish times shown are different (and wrong) by ca +13h from those shown when in the "wnpostproc" state, which were correct. As I remember, job 72 has been similarly affected. Maybe the times aren't supposed to refer to the actual processing, which the user doesn't care about, but to the availability of a result, which they do. BUT this only really applies to the finish time so why is the start time altered as well? Dashboard Strikes Again (must remember no article capital D capital D capital D...) ID: 1276 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1277 - Posted: 21 Oct 2015, 9:59:32 UTC - in response to Message 1276. I long ago gave up trying to decipher the workings of Dashboard... ID: 1277 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1278 - Posted: 21 Oct 2015, 10:52:00 UTC - in response to Message 1275. http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt That's some sort of a redirect to my home directory. I'll ask Andrew to check it. Turns out it's not easily activated -- port 80 doesn't get through their firewall for a start, so let's not pursue it at this time. ID: 1278 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1279 - Posted: 21 Oct 2015, 11:15:41 UTC - in response to Message 1278. OK, thanks. ID: 1279 · Rating: 0 · rate: / Reply Quote

Jim1348 Send message Joined: 17 Aug 15 Posts: 17 Credit: 228,358 RAC: 0	Message 1280 - Posted: 21 Oct 2015, 13:38:13 UTC Last modified: 21 Oct 2015, 13:40:36 UTC In running the new CMS-dev jobs on my i7-4790 (Win7 64-bit, Asrock Z97 motherboard), I am starting to see "Error ID 19", which is an "Internal parity error". This machine is not overclocked in any way, and I have never seen this error before in any circumstance. However, I am now running two vLHC jobs also, which I hadn't done before with the CMS-dev, so it may be the combination that is triggering the error. It may not really be an error, but more of a bug, since it has all the same symptoms as this: https://communities.vmware.com/thread/471348 At present, it is occurring about every 1/2 hour to 1 1/2 hours, and the processor ID may be different. It has not caused any operational difficulties or crashes thus far. In case anyone is interested, the full event is: Log Name: System Source: Microsoft-Windows-WHEA-Logger Date: 10/21/2015 7:59:49 AM Event ID: 19 Task Category: None Level: Warning Keywords: User: LOCAL SERVICE Computer: i7-4790-PC Description: A corrected hardware error has occurred. Reported by component: Processor Core Error Source: Corrected Machine Check Error Type: Internal parity error Processor ID: 1 The details view of this entry contains further information. Event Xml: <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event"> <System> <Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{C26C4F3C-3F66-4E99-8F8A-39405CFED220}" /> <EventID>19</EventID> <Version>0</Version> <Level>3</Level> <Task>0</Task> <Opcode>0</Opcode> <Keywords>0x8000000000000000</Keywords> <TimeCreated SystemTime="2015-10-21T11:59:49.190491900Z" /> <EventRecordID>17477</EventRecordID> <Correlation ActivityID="{9635EFCA-D3B3-4861-B150-A061BC3C0D5E}" /> <Execution ProcessID="1520" ThreadID="2156" /> <Channel>System</Channel> <Computer>i7-4790-PC</Computer> <Security UserID="S-1-5-19" /> </System> <EventData> <Data Name="ErrorSource">1</Data> <Data Name="ApicId">1</Data> <Data Name="MCABank">0</Data> <Data Name="MciStat">0x90000040000f0005</Data> <Data Name="MciAddr">0x0</Data> <Data Name="MciMisc">0x0</Data> <Data Name="ErrorType">12</Data> <Data Name="TransactionType">256</Data> <Data Name="Participation">256</Data> <Data Name="RequestType">256</Data> <Data Name="MemorIO">256</Data> <Data Name="MemHierarchyLvl">256</Data> <Data Name="Timeout">256</Data> <Data Name="OperationType">256</Data> <Data Name="Channel">256</Data> <Data Name="Length">864</Data> <Data Name="RawData">435045521002FFFFFFFF0300020000000200000060030000303B0B00150A0F140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F890A688C9ADD60BD10100000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000018020000400000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000002000000000000000000000000000000000000000000000058020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000057010000000000000002080000000000C30603000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000000100000000000000C306030000081001FFFBFA7FFFFBEBBF000000000000000000000000000000000000000000000000000000000000000001000000010000009E8324FCF70BD10101000000000000000000000000000000000000000000000005000F0040000090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data> </EventData> </Event> ID: 1280 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1239 Credit: 962,071 RAC: 538	Message 1322 - Posted: 24 Oct 2015, 11:24:50 UTC Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures: Still running 1 and only 11 failed. In WNPostProc state 2 (Don't know why post processing takes that long for those 2) Success 1,986 From those success jobs: 28 needed 3 attempts to finish and 249 succeeded after 2 attemps. IMO that number of extra attemps is excessive. ID: 1322 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 1326 - Posted: 24 Oct 2015, 12:56:40 UTC - in response to Message 1322. Last modified: 24 Oct 2015, 12:58:45 UTC Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures: Still running 1 and only 11 failed. In WNPostProc state 2 (Don't know why post processing takes that long for those 2) Success 1,986 From those success jobs: 28 needed 3 attempts to finish and 249 succeeded after 2 attemps. IMO that number of extra attemps is excessive. I would agree. You just reminded me that I never got a reply from the chap with the excessive failures, I'll have to try e-mail rather than PM. If you check the failures and sort by IP, he had ~90 of the failures. ID: 1326 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0	Message 1328 - Posted: 24 Oct 2015, 14:58:09 UTC - in response to Message 1322. Last modified: 24 Oct 2015, 15:04:04 UTC From those success jobs: 28 needed 3 attempts to finish and 249 succeeded after 2 attemps. IMO that number of extra attemps is excessive. Has anybody carefully followed what happens to those jobs that are abandoned when the host is shutdown/rebooted or whatever? Could they show up like this:- "N/A / Error return without specification"? Many initial failures do. Maybe Condor puts the "error" bit in because some timeout or heartbeat fails, there certainly wouldn't be a "specification". They all require at least one retry. At the moment there will probably be a higher rate of these events because people are "poking around", looking for problems and suchlike. Many hosts (mine...) don't run continuously, that tends towards one extra attempt per host per day. IMO you can't reasonably require volunteers to run machines continuously, although I realise many do. ID: 1328 · Rating: 0 · rate: / Reply Quote

Development for LHC@home