Message boards : News : New jobs available
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1264 - Posted: 20 Oct 2015, 12:20:57 UTC
Last modified: 20 Oct 2015, 12:21:44 UTC

I've now submitted a larger batch of jobs since the failure rate seems manageable. There were a few host IP addresses recurring amongst the failures, I'll keep an eye out out for them in future and contact the owners if they continue to misbehave. You can start running tasks again now if you wish.
ID: 1264 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1265 - Posted: 20 Oct 2015, 13:01:32 UTC - in response to Message 1264.  
Last modified: 20 Oct 2015, 13:19:52 UTC

You can start running tasks again now if you wish.

Ok. Job 35 is running here and, apart from much downloading to start with, is running OK.

Edit. Times and IP shown in Dashboard are correct.... so far.
ID: 1265 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1266 - Posted: 20 Oct 2015, 13:17:19 UTC - in response to Message 1265.  

Ok. Job 35 is running here and, apart from much downloading to start with, is running OK.

Probably re-synching changed files in cvmfs.
ID: 1266 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 1267 - Posted: 20 Oct 2015, 14:12:59 UTC

Sorry to say: Job 71 will not return from me.

I suspended the task in BOINC and instead of saving the VM, the VM stopped.

So after resume the VM could not restore from a saved point and had to reboot.
ID: 1267 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1268 - Posted: 20 Oct 2015, 14:28:16 UTC - in response to Message 1266.  
Last modified: 20 Oct 2015, 15:06:31 UTC

Ok. Job 35 is running here and, apart from much downloading to start with, is running OK.

Probably re-synching changed files in cvmfs.

Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed.
Again, dashboard info seems OK. It will interesting to see what dashboard shows when a job fails.

Edit. Just noticed this:-

*** G4Exception : had012 issued by G4HadronicProcess:CheckResult()
Warning: Bad energy non-conservation detected, will re-sample the interaction
Process / Model: NeutronInelastic / FTFP
Primary: neutron (2112), E= 20055.7, target nucleus (6,12)
E(initial - final) = 7856.64 MeV.

*** This is just a warning message

That looks good, maybe it means we get something for nothing.
ID: 1268 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1269 - Posted: 20 Oct 2015, 16:06:31 UTC - in response to Message 1268.  

Ok. Job 35 is running here and, apart from much downloading to start with, is running OK.

Probably re-synching changed files in cvmfs.

Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed.

Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak.
ID: 1269 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 1270 - Posted: 20 Oct 2015, 17:06:15 UTC

First attempt of Job 150 failed:

10/20/15 17:48:09 (pid:13725) Initialized IO Proxy.
10/20/15 17:48:09 (pid:13725) Done setting resource limits
10/20/15 17:48:09 (pid:13725) FILETRANSFER: "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/20/15 17:48:09 (pid:13725) FILETRANSFER: failed to add plugin "/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin" because: FILETRANSFER:1:"/home/boinc/CMSRun/glide_soCNBI/main/condor/libexec/curl_plugin -classad" did not produce any output, ignoring
10/20/15 17:48:09 (pid:13725) Got SIGTERM. Performing graceful shutdown.
10/20/15 17:48:09 (pid:13725) ShutdownGraceful all jobs.
10/20/15 17:48:10 (pid:13725) ERROR "FileTransfer::UpLoadFiles called during active transfer!
" at line 1159 in file /slots/12/dir_4417/userdir/src/condor_utils/file_transfer.cpp
10/20/15 17:48:10 (pid:13725) ShutdownFast all jobs.
10/20/15 17:48:10 (pid:13728) Failed to receive transfer queue response from schedd at <130.246.180.120:59704> for job 156053.0 (initial file /var/lib/condor/spool/5897/0/cluster155897.proc0.subproc0/CMSRunAnalysis.sh).
10/20/15 17:48:10 (pid:13731) ******************************************************
ID: 1270 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1271 - Posted: 20 Oct 2015, 17:40:24 UTC

My machine sits here for hours now and seems to Loop through something endless:

from http://localhost:52386/logs/run-1/glide_ETgbjq/StartdLog

-----------------------------------------
10/20/15 17:50:38 (pid:7603) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#102700
10/20/15 18:00:13 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:00:13 (pid:7603) Buf::write(): condor_write() failed
10/20/15 18:11:29 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:11:29 (pid:7603) Buf::write(): condor_write() failed
10/20/15 18:22:45 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:22:45 (pid:7603) Buf::write(): condor_write() failed
10/20/15 18:34:01 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:34:01 (pid:7603) Buf::write(): condor_write() failed
10/20/15 18:45:18 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:45:18 (pid:7603) Buf::write(): condor_write() failed
10/20/15 18:56:34 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 18:56:34 (pid:7603) Buf::write(): condor_write() failed
10/20/15 19:07:50 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 19:07:50 (pid:7603) Buf::write(): condor_write() failed
10/20/15 19:19:06 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 19:19:06 (pid:7603) Buf::write(): condor_write() failed
10/20/15 19:30:22 (pid:7603) condor_write(): Socket closed when trying to write 4096 bytes to collector lcggwms02.gridpp.rl.ac.uk:9620, fd is 8
10/20/15 19:30:22 (pid:7603) Buf::write(): condor_write() failed
----------------------------------------------------------

And this is last 5 minutes from http://localhost:52386/logs/run-1/glide_ETgbjq/ProcLog:
----------------------------------------------------------
10/20/15 19:31:00 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:00 : gathering usage data for family with root pid 11118
10/20/15 19:31:01 : taking a snapshot...
10/20/15 19:31:01 : ...snapshot complete
10/20/15 19:31:05 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:05 : gathering usage data for family with root pid 11118
10/20/15 19:31:10 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:10 : gathering usage data for family with root pid 11118
10/20/15 19:31:15 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:15 : gathering usage data for family with root pid 11118
10/20/15 19:31:16 : taking a snapshot...
10/20/15 19:31:16 : ...snapshot complete
10/20/15 19:31:20 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:20 : gathering usage data for family with root pid 11118
10/20/15 19:31:25 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:25 : gathering usage data for family with root pid 11118
10/20/15 19:31:30 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:30 : gathering usage data for family with root pid 11118
10/20/15 19:31:31 : taking a snapshot...
10/20/15 19:31:31 : ...snapshot complete
10/20/15 19:31:35 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:35 : gathering usage data for family with root pid 11118
10/20/15 19:31:40 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:40 : gathering usage data for family with root pid 11118
10/20/15 19:31:45 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:45 : gathering usage data for family with root pid 11118
10/20/15 19:31:46 : taking a snapshot...
10/20/15 19:31:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682
10/20/15 19:31:46 : ...snapshot complete
10/20/15 19:31:50 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:50 : gathering usage data for family with root pid 11118
10/20/15 19:31:55 : PROC_FAMILY_GET_USAGE
10/20/15 19:31:55 : gathering usage data for family with root pid 11118
10/20/15 19:32:00 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:00 : gathering usage data for family with root pid 11118
10/20/15 19:32:01 : taking a snapshot...
10/20/15 19:32:01 : ...snapshot complete
10/20/15 19:32:05 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:05 : gathering usage data for family with root pid 11118
10/20/15 19:32:10 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:10 : gathering usage data for family with root pid 11118
10/20/15 19:32:15 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:15 : gathering usage data for family with root pid 11118
10/20/15 19:32:16 : taking a snapshot...
10/20/15 19:32:16 : ...snapshot complete
10/20/15 19:32:20 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:20 : gathering usage data for family with root pid 11118
10/20/15 19:32:25 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:25 : gathering usage data for family with root pid 11118
10/20/15 19:32:30 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:30 : gathering usage data for family with root pid 11118
10/20/15 19:32:31 : taking a snapshot...
10/20/15 19:32:31 : ...snapshot complete
10/20/15 19:32:35 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:35 : gathering usage data for family with root pid 11118
10/20/15 19:32:40 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:40 : gathering usage data for family with root pid 11118
10/20/15 19:32:45 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:45 : gathering usage data for family with root pid 11118
10/20/15 19:32:46 : taking a snapshot...
10/20/15 19:32:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682
10/20/15 19:32:46 : ...snapshot complete
10/20/15 19:32:50 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:50 : gathering usage data for family with root pid 11118
10/20/15 19:32:55 : PROC_FAMILY_GET_USAGE
10/20/15 19:32:55 : gathering usage data for family with root pid 11118
10/20/15 19:33:00 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:00 : gathering usage data for family with root pid 11118
10/20/15 19:33:01 : taking a snapshot...
10/20/15 19:33:01 : ...snapshot complete
10/20/15 19:33:05 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:05 : gathering usage data for family with root pid 11118
10/20/15 19:33:10 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:10 : gathering usage data for family with root pid 11118
10/20/15 19:33:15 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:15 : gathering usage data for family with root pid 11118
10/20/15 19:33:16 : taking a snapshot...
10/20/15 19:33:16 : ...snapshot complete
10/20/15 19:33:20 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:20 : gathering usage data for family with root pid 11118
10/20/15 19:33:25 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:25 : gathering usage data for family with root pid 11118
10/20/15 19:33:30 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:30 : gathering usage data for family with root pid 11118
10/20/15 19:33:31 : taking a snapshot...
10/20/15 19:33:31 : ...snapshot complete
10/20/15 19:33:35 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:35 : gathering usage data for family with root pid 11118
10/20/15 19:33:40 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:40 : gathering usage data for family with root pid 11118
10/20/15 19:33:45 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:45 : gathering usage data for family with root pid 11118
10/20/15 19:33:46 : taking a snapshot...
10/20/15 19:33:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682
10/20/15 19:33:46 : ...snapshot complete
10/20/15 19:33:50 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:50 : gathering usage data for family with root pid 11118
10/20/15 19:33:55 : PROC_FAMILY_GET_USAGE
10/20/15 19:33:55 : gathering usage data for family with root pid 11118
10/20/15 19:34:00 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:00 : gathering usage data for family with root pid 11118
10/20/15 19:34:01 : taking a snapshot...
10/20/15 19:34:01 : ...snapshot complete
10/20/15 19:34:05 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:05 : gathering usage data for family with root pid 11118
10/20/15 19:34:10 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:10 : gathering usage data for family with root pid 11118
10/20/15 19:34:15 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:15 : gathering usage data for family with root pid 11118
10/20/15 19:34:16 : taking a snapshot...
10/20/15 19:34:16 : ...snapshot complete
10/20/15 19:34:20 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:20 : gathering usage data for family with root pid 11118
10/20/15 19:34:25 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:25 : gathering usage data for family with root pid 11118
10/20/15 19:34:30 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:30 : gathering usage data for family with root pid 11118
10/20/15 19:34:31 : taking a snapshot...
10/20/15 19:34:31 : ...snapshot complete
10/20/15 19:34:35 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:35 : gathering usage data for family with root pid 11118
10/20/15 19:34:40 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:40 : gathering usage data for family with root pid 11118
10/20/15 19:34:45 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:45 : gathering usage data for family with root pid 11118
10/20/15 19:34:46 : taking a snapshot...
10/20/15 19:34:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682
10/20/15 19:34:46 : ...snapshot complete
10/20/15 19:34:50 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:50 : gathering usage data for family with root pid 11118
10/20/15 19:34:55 : PROC_FAMILY_GET_USAGE
10/20/15 19:34:55 : gathering usage data for family with root pid 11118
10/20/15 19:35:00 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:00 : gathering usage data for family with root pid 11118
10/20/15 19:35:01 : taking a snapshot...
10/20/15 19:35:01 : ...snapshot complete
10/20/15 19:35:05 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:05 : gathering usage data for family with root pid 11118
10/20/15 19:35:10 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:10 : gathering usage data for family with root pid 11118
10/20/15 19:35:15 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:15 : gathering usage data for family with root pid 11118
10/20/15 19:35:16 : taking a snapshot...
10/20/15 19:35:16 : ...snapshot complete
10/20/15 19:35:20 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:20 : gathering usage data for family with root pid 11118
10/20/15 19:35:25 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:25 : gathering usage data for family with root pid 11118
10/20/15 19:35:31 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:31 : gathering usage data for family with root pid 11118
10/20/15 19:35:31 : taking a snapshot...
10/20/15 19:35:31 : ...snapshot complete
10/20/15 19:35:36 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:36 : gathering usage data for family with root pid 11118
10/20/15 19:35:41 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:41 : gathering usage data for family with root pid 11118
10/20/15 19:35:46 : taking a snapshot...
10/20/15 19:35:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683
10/20/15 19:35:46 : ...snapshot complete
10/20/15 19:35:46 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:46 : gathering usage data for family with root pid 11118
10/20/15 19:35:51 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:51 : gathering usage data for family with root pid 11118
10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:56 : gathering usage data for family with root pid 11118
10/20/15 19:35:56 : PROC_FAMILY_GET_USAGE
10/20/15 19:35:56 : gathering usage data for family with root pid 11118
10/20/15 19:36:01 : taking a snapshot...
10/20/15 19:36:01 : ...snapshot complete
10/20/15 19:36:01 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:01 : gathering usage data for family with root pid 11118
10/20/15 19:36:06 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:06 : gathering usage data for family with root pid 11118
10/20/15 19:36:11 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:11 : gathering usage data for family with root pid 11118
10/20/15 19:36:16 : taking a snapshot...
10/20/15 19:36:16 : ...snapshot complete
10/20/15 19:36:16 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:16 : gathering usage data for family with root pid 11118
10/20/15 19:36:21 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:21 : gathering usage data for family with root pid 11118
10/20/15 19:36:26 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:26 : gathering usage data for family with root pid 11118
10/20/15 19:36:31 : taking a snapshot...
10/20/15 19:36:31 : ...snapshot complete
10/20/15 19:36:31 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:31 : gathering usage data for family with root pid 11118
10/20/15 19:36:36 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:36 : gathering usage data for family with root pid 11118
10/20/15 19:36:41 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:41 : gathering usage data for family with root pid 11118
10/20/15 19:36:46 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:46 : gathering usage data for family with root pid 11118
10/20/15 19:36:46 : taking a snapshot...
10/20/15 19:36:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350683
10/20/15 19:36:46 : ...snapshot complete
10/20/15 19:36:51 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:51 : gathering usage data for family with root pid 11118
10/20/15 19:36:56 : PROC_FAMILY_GET_USAGE
10/20/15 19:36:56 : gathering usage data for family with root pid 11118
10/20/15 19:37:01 : taking a snapshot...
10/20/15 19:37:01 : ...snapshot complete
10/20/15 19:37:01 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:01 : gathering usage data for family with root pid 11118
10/20/15 19:37:06 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:06 : gathering usage data for family with root pid 11118
10/20/15 19:37:11 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:11 : gathering usage data for family with root pid 11118
10/20/15 19:37:16 : taking a snapshot...
10/20/15 19:37:16 : ...snapshot complete
10/20/15 19:37:16 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:16 : gathering usage data for family with root pid 11118
10/20/15 19:37:21 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:21 : gathering usage data for family with root pid 11118
10/20/15 19:37:26 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:26 : gathering usage data for family with root pid 11118
10/20/15 19:37:31 : taking a snapshot...
10/20/15 19:37:31 : ...snapshot complete
10/20/15 19:37:31 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:31 : gathering usage data for family with root pid 11118
10/20/15 19:37:36 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:36 : gathering usage data for family with root pid 11118
10/20/15 19:37:41 : PROC_FAMILY_GET_USAGE
10/20/15 19:37:41 : gathering usage data for family with root pid 11118
10/20/15 19:37:46 : taking a snapshot...
10/20/15 19:37:46 : ProcAPI: new boottime = 1445350682; old_boottime = 1445350682; /proc/stat boottime = 1445350682; /proc/uptime boottime = 1445350682
10/20/15 19:37:46 : ...snapshot complete
10/20/15 19:37:46 : PROC_FAMILY_GET_USAGE
ID: 1271 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1272 - Posted: 20 Oct 2015, 18:55:17 UTC - in response to Message 1271.  

My machine sits here for hours now and seems to Loop through something endless:

Thanks to Microsoft, they had published a patch, this forced my Desktop to reboot and now I'm back crunching
ID: 1272 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1273 - Posted: 20 Oct 2015, 19:19:11 UTC - in response to Message 1269.  

Ok. Job 35 is running here and, apart from much downloading to start with, is running OK.

Probably re-synching changed files in cvmfs.

Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed.

Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak.

As far as I understand it, post-processing is done on the server, given that before it's done the job output file says "waiing for postproc" and the postproc log says "queued for postproc". In any event, output for 35 showed up on the databridge at 1407 UTC, so Condor, which feeds Dashboard, was being slow to report.
ID: 1273 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1274 - Posted: 20 Oct 2015, 19:46:49 UTC - in response to Message 1273.  
Last modified: 20 Oct 2015, 19:56:32 UTC


Seems reasonable. It seems to have downloaded about 500MB. The next one, 94, started (well, fairly) quickly. When it's done, I'll put this box back to bed.

Maybe I was a bit too eager, shut down after seeing exit status 0 appear in dashboard, but it's been stuck in "post processing" for over half an hour now whilst many other jobs have come and gone. Is the post processing step done on volunteer machine?, dashboard shows "wnpostproc" which implies that it is, if so I cut it off in it's prime, so to speak.

As far as I understand it, post-processing is done on the server, given that before it's done the job output file says "waiing for postproc" and the postproc log says "queued for postproc". In any event, output for 35 showed up on the databridge at 1407 UTC, so Condor, which feeds Dashboard, was being slow to report.

Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck.
The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials.
ID: 1274 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1275 - Posted: 21 Oct 2015, 7:38:53 UTC - in response to Message 1274.  

Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck.
The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials.

Ah, sorry, my eyes aren't what they used to be. (Omnes: They used to be my ears!) Let's see; 94 returned a result at 0353 Wed, 72 at 0350, 104 at 0715, and and 152 at 1611 Tues. I've not been able to get job logs from Dashboard either, they get served up from Condor -- its Web server may need tickling -- and I'd expect they come from the output files I see, which aren't available until after post-processing. Hmm, yes,
http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt
That's some sort of a redirect to my home directory. I'll ask Andrew to check it.
ID: 1275 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1276 - Posted: 21 Oct 2015, 9:19:06 UTC - in response to Message 1275.  

Thanks, Ivan, but the job in question is 94 and is one of four which seem similarly stuck.
The others are 72, 104 and 152. I can't get to any of the job logs from dashboard - just times out, doesn't ask for any credentials.

Ah, sorry, my eyes aren't what they used to be. (Omnes: They used to be my ears!) Let's see; 94 returned a result at 0353 Wed, 72 at 0350, 104 at 0715, and and 152 at 1611 Tues. I've not been able to get job logs from Dashboard either, they get served up from Condor -- its Web server may need tickling -- and I'd expect they come from the output files I see, which aren't available until after post-processing. Hmm, yes,
http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt
That's some sort of a redirect to my home directory. I'll ask Andrew to check it.

Something very strange is going on... maybe the workings of a cache or cleanup script somewhere. Job 94 now shows as "finished" as do 72 and 104 whilst 152 is back to "running". No retries. The only job of mine is 94 and the start and finish times shown are different (and wrong) by ca +13h from those shown when in the "wnpostproc" state, which were correct. As I remember, job 72 has been similarly affected.
Maybe the times aren't supposed to refer to the actual processing, which the user doesn't care about, but to the availability of a result, which they do. BUT this only really applies to the finish time so why is the start time altered as well?
Dashboard Strikes Again (must remember no article capital D capital D capital D...)
ID: 1276 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1277 - Posted: 21 Oct 2015, 9:59:32 UTC - in response to Message 1276.  

I long ago gave up trying to decipher the workings of Dashboard...
ID: 1277 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1278 - Posted: 21 Oct 2015, 10:52:00 UTC - in response to Message 1275.  

http://lcggwms02.gridpp.rl.ac.uk/mon/cms005/151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14/job_out.94.0.txt
That's some sort of a redirect to my home directory. I'll ask Andrew to check it.

Turns out it's not easily activated -- port 80 doesn't get through their firewall for a start, so let's not pursue it at this time.
ID: 1278 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1279 - Posted: 21 Oct 2015, 11:15:41 UTC - in response to Message 1278.  

OK, thanks.
ID: 1279 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Jim1348

Send message
Joined: 17 Aug 15
Posts: 17
Credit: 228,358
RAC: 0
Message 1280 - Posted: 21 Oct 2015, 13:38:13 UTC
Last modified: 21 Oct 2015, 13:40:36 UTC

In running the new CMS-dev jobs on my i7-4790 (Win7 64-bit, Asrock Z97 motherboard), I am starting to see "Error ID 19", which is an "Internal parity error".

This machine is not overclocked in any way, and I have never seen this error before in any circumstance.
However, I am now running two vLHC jobs also, which I hadn't done before with the CMS-dev, so it may be the combination that is triggering the error.

It may not really be an error, but more of a bug, since it has all the same symptoms as this: https://communities.vmware.com/thread/471348
At present, it is occurring about every 1/2 hour to 1 1/2 hours, and the processor ID may be different. It has not caused any operational difficulties or crashes thus far.

In case anyone is interested, the full event is:

Log Name: System
Source: Microsoft-Windows-WHEA-Logger
Date: 10/21/2015 7:59:49 AM
Event ID: 19
Task Category: None
Level: Warning
Keywords:
User: LOCAL SERVICE
Computer: i7-4790-PC
Description:
A corrected hardware error has occurred.

Reported by component: Processor Core
Error Source: Corrected Machine Check
Error Type: Internal parity error
Processor ID: 1

The details view of this entry contains further information.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
<Provider Name="Microsoft-Windows-WHEA-Logger" Guid="{C26C4F3C-3F66-4E99-8F8A-39405CFED220}" />
<EventID>19</EventID>
<Version>0</Version>
<Level>3</Level>
<Task>0</Task>
<Opcode>0</Opcode>
<Keywords>0x8000000000000000</Keywords>
<TimeCreated SystemTime="2015-10-21T11:59:49.190491900Z" />
<EventRecordID>17477</EventRecordID>
<Correlation ActivityID="{9635EFCA-D3B3-4861-B150-A061BC3C0D5E}" />
<Execution ProcessID="1520" ThreadID="2156" />
<Channel>System</Channel>
<Computer>i7-4790-PC</Computer>
<Security UserID="S-1-5-19" />
</System>
<EventData>
<Data Name="ErrorSource">1</Data>
<Data Name="ApicId">1</Data>
<Data Name="MCABank">0</Data>
<Data Name="MciStat">0x90000040000f0005</Data>
<Data Name="MciAddr">0x0</Data>
<Data Name="MciMisc">0x0</Data>
<Data Name="ErrorType">12</Data>
<Data Name="TransactionType">256</Data>
<Data Name="Participation">256</Data>
<Data Name="RequestType">256</Data>
<Data Name="MemorIO">256</Data>
<Data Name="MemHierarchyLvl">256</Data>
<Data Name="Timeout">256</Data>
<Data Name="OperationType">256</Data>
<Data Name="Channel">256</Data>
<Data Name="Length">864</Data>
<Data Name="RawData">435045521002FFFFFFFF0300020000000200000060030000303B0B00150A0F140000000000000000000000000000000000000000000000000000000000000000BDC407CF89B7184EB3C41F732CB57131B18BCE2DD7BD0E45B9AD9CF4EBD4F890A688C9ADD60BD10100000000000000000000000000000000000000000000000058010000C00000000102000001000000ADCC7698B447DB4BB65E16F193C4F3DB0000000000000000000000000000000002000000000000000000000000000000000000000000000018020000400000000102000000000000B0A03EDC44A19747B95B53FA242B6E1D0000000000000000000000000000000002000000000000000000000000000000000000000000000058020000080100000102000000000000011D1E8AF94257459C33565E5CC3F7E80000000000000000000000000000000002000000000000000000000000000000000000000000000057010000000000000002080000000000C30603000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000000000000000000000000000000000000000000000000000000000000000000003000000000000000100000000000000C306030000081001FFFBFA7FFFFBEBBF000000000000000000000000000000000000000000000000000000000000000001000000010000009E8324FCF70BD10101000000000000000000000000000000000000000000000005000F0040000090000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000</Data>
</EventData>
</Event>
ID: 1280 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 15
Message 1322 - Posted: 24 Oct 2015, 11:24:50 UTC

Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures:

Still running 1 and only 11 failed.
In WNPostProc state 2 (Don't know why post processing takes that long for those 2)

Success 1,986

From those success jobs:

28 needed 3 attempts to finish and 249 succeeded after 2 attemps.
IMO that number of extra attemps is excessive.
ID: 1322 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,304,147
RAC: 3,524
Message 1326 - Posted: 24 Oct 2015, 12:56:40 UTC - in response to Message 1322.  
Last modified: 24 Oct 2015, 12:58:45 UTC

Now the last batch of 2,000 jobs from task 151020_120153:ireid_crab_CMS_at_Home_TTbar_Phedex14 is (almost) done some figures:

Still running 1 and only 11 failed.
In WNPostProc state 2 (Don't know why post processing takes that long for those 2)

Success 1,986

From those success jobs:

28 needed 3 attempts to finish and 249 succeeded after 2 attemps.
IMO that number of extra attemps is excessive.

I would agree. You just reminded me that I never got a reply from the chap with the excessive failures,
I'll have to try e-mail rather than PM.
If you check the failures and sort by IP, he had ~90 of the failures.
ID: 1326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1328 - Posted: 24 Oct 2015, 14:58:09 UTC - in response to Message 1322.  
Last modified: 24 Oct 2015, 15:04:04 UTC


From those success jobs:

28 needed 3 attempts to finish and 249 succeeded after 2 attemps.
IMO that number of extra attemps is excessive.

Has anybody carefully followed what happens to those jobs that are abandoned when the host is shutdown/rebooted or whatever? Could they show up like this:-

"N/A / Error return without specification"? Many initial failures do. Maybe Condor puts the "error" bit in because some timeout or heartbeat fails, there certainly wouldn't be a "specification".

They all require at least one retry. At the moment there will probably be a higher rate of these events because people are "poking around", looking for problems and suchlike. Many hosts (mine...) don't run continuously, that tends towards one extra attempt per host per day. IMO you can't reasonably require volunteers to run machines continuously, although I realise many do.
ID: 1328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : New jobs available


©2024 CERN