Thread 'No new jobs'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1324 - Posted: 24 Oct 2015, 12:46:02 UTC - in response to Message 1321. I just checked your record of jobs and tasks. Your tasks are taking 5-7 days to complete! In fact your last task ran from 18 Oct 2015, 11:25:39 UTC to 23 Oct 2015, 22:17:34 UTC but only successfully returned one job, processed between Oct 20 19:39:15 GMT and Oct 20 20:58:24 GMT. Do you leave your computer switched off most of the time? This sort of processing is not greatly suited for part-time computing, as network sockets, etc. get lost when it is disturbed ID: 1324 · Rating: 0 · rate: / Reply Quote

newman Send message Joined: 15 Feb 15 Posts: 10 Credit: 16,387 RAC: 0	Message 1325 - Posted: 24 Oct 2015, 12:54:03 UTC - in response to Message 1324. Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then Marcus ID: 1325 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1327 - Posted: 24 Oct 2015, 13:00:16 UTC - in response to Message 1325. Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then Marcus OK, Marcus. Thanks for trying. ID: 1327 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 17	Message 1357 - Posted: 28 Oct 2015, 16:36:06 UTC Job queue is nearly empty! ID: 1357 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1358 - Posted: 28 Oct 2015, 16:38:42 UTC - in response to Message 1357. Job queue is nearly empty! Yep, just noticed. It would have lasted until tomorrow if Laurence et al. hadn't fixed that pesky bug! ID: 1358 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 160 Credit: 2,934,593 RAC: 767	Message 1361 - Posted: 28 Oct 2015, 19:19:00 UTC what does this mean ? 10/28/15 18:27:33 (pid:22064) Create_Process succeeded, pid=22068 10/28/15 19:07:32 (pid:22064) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 10/28/15 19:07:32 (pid:22064) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 10/28/15 19:08:33 (pid:22064) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#108698 10/28/15 19:12:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:12:45 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:17:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:17:45 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:22:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:22:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:27:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:27:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:32:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:32:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:37:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:37:47 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:42:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:42:47 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:47:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:47:48 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:52:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:52:48 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:52 (pid:22064) Process exited, pid=22068, status=0 10/28/15 19:55:52 (pid:22064) condor_write(): Socket closed when trying to write 617 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:55:52 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:53 (pid:22064) condor_write(): Socket closed when trying to write 190 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:55:53 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:53 (pid:22064) Failed to send job exit status to shadow 10/28/15 19:55:53 (pid:22064) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/28/15 19:55:53 (pid:22064) Returning from CStarter::JobReaper() ID: 1361 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1363 - Posted: 28 Oct 2015, 19:55:48 UTC - in response to Message 1361. what does this mean ? 10/28/15 18:27:33 (pid:22064) Create_Process succeeded, pid=22068 10/28/15 19:07:32 (pid:22064) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 10/28/15 19:07:32 (pid:22064) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 10/28/15 19:08:33 (pid:22064) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#108698 10/28/15 19:12:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 ... 10/28/15 19:55:53 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:53 (pid:22064) Failed to send job exit status to shadow 10/28/15 19:55:53 (pid:22064) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/28/15 19:55:53 (pid:22064) Returning from CStarter::JobReaper() I'll ask Andrew to check his logs. ID: 1363 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1366 - Posted: 29 Oct 2015, 10:06:51 UTC - in response to Message 1361. Last modified: 29 Oct 2015, 10:11:53 UTC You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. [Added] From a later message, one theory is having a changed IP address. [/Added] ID: 1366 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1256 Credit: 1,013,858 RAC: 138	Message 1371 - Posted: 29 Oct 2015, 10:55:28 UTC - in response to Message 1366. You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. [Added] From a later message, one theory is having a changed IP address. [/Added] Question not meant for me, but I'm maybe allowed to interrupt. Although I've a static IP, I regularly see jobnumbers send to my machine in the 1st attempt with a WNIp not belonging to me. ID: 1371 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 1	Message 1372 - Posted: 29 Oct 2015, 13:13:45 UTC - in response to Message 1371. Last modified: 29 Oct 2015, 13:17:23 UTC You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. [Added] From a later message, one theory is having a changed IP address. [/Added] Question not meant for me, but I'm maybe allowed to interrupt. Although I've a static IP, I regularly see jobnumbers send to my machine in the 1st attempt with a WNIp not belonging to me. If I may add to this.I see this, too although not often.I think that these jobs have been started but "abandoned" by the host whose public IP you see. It's the other side of this. We need to treat IPs with care. ID: 1372 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1256 Credit: 1,013,858 RAC: 138	Message 1374 - Posted: 29 Oct 2015, 13:33:42 UTC - in response to Message 1372. If I may add to this.I see this, too although not often.I think that these jobs have been started but "abandoned" by the host whose public IP you see. It's the other side of this. We need to treat IPs with care. In that case one would expect to see that I'm the 2nd attempt. I agree that showing user´s (static) IP to everyone on the web could be a reason for me to detach this project from my project´s list. ID: 1374 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 1	Message 1375 - Posted: 29 Oct 2015, 13:37:12 UTC - in response to Message 1374. Last modified: 29 Oct 2015, 14:04:59 UTC No, it doesn't show up as a retry. Assuming you complete the job successfully, the final Dashboard details show the original host IP with your timestamps. Perhaps we should make this a separate thread. Edit. and, yes, we are showing our IPs to the world+dog. ID: 1375 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1437 - Posted: 8 Nov 2015, 16:01:05 UTC Still no resolution to the current stage-out failures so I'll not be submitting any new jobs until there is an explanation. More news tomorrow, I hope. ID: 1437 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1438 - Posted: 9 Nov 2015, 12:48:38 UTC - in response to Message 1437. It's looking like the jobs are succeeding, but are resubmitted, and the failures come when the second job tries to write a result file that already exists. I'm hoping the experts can sort it out... ID: 1438 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1439 - Posted: 12 Nov 2015, 12:30:59 UTC - in response to Message 1438. OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week. Thanks to the CERN crew for finding and fixing several problems. ID: 1439 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1446 - Posted: 12 Nov 2015, 21:56:26 UTC - in response to Message 1439. OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week. I'm not particularly happy with the failure rate in the Jobs graphs. I suspect we still have the post-processing problem. I'll know better when I analyse the logs tomorrow. I'll probably continue for a while with these short jobs, to minimise time lost in possible unnecessary retries. ID: 1446 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 17	Message 1507 - Posted: 13 Dec 2015, 12:12:01 UTC We are starting to run low on jobs. ID: 1507 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1508 - Posted: 13 Dec 2015, 12:29:50 UTC - in response to Message 1507. Last modified: 13 Dec 2015, 12:48:13 UTC We are starting to run low on jobs. Yes, I know. I'll submit more later. The failure rate is high the last few days, due in part to one particular machine which had a strange error. I had the owner abort the task and start a new one -- same problem. Then I had him reset the project and since then (around 0000 GMT) there's only been one result from it, a failure but just one in 12 hours rather than the previous three or four per hour. There must be another as well, as the transient failure rates are still high. We really need to abort tasks if there is an error return in a job, if only to slow down the error rate! [Edit] Have heard back from the owner -- newly installed security software was blocking access to the "frontier" (conditions database) server so the jobs couldn't retrieve data on the detector's configuration. Crossed fingers... (Error code 66 for future reference.) [/Edit] ID: 1508 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 17	Message 1509 - Posted: 13 Dec 2015, 12:50:16 UTC Thanks ivan, It would be nice to a´have a simple way to see, how many successful/failed jobs a host actually has. Tracing it on dashbord is somewhat cumbersome. I am running 96hour cms-tasks, as each job takes about 7-8h. ID: 1509 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1154 Credit: 8,334,974 RAC: 1,349	Message 1510 - Posted: 13 Dec 2015, 15:50:34 UTC - in response to Message 1509. I agree, but I wouldn't put it as a high priority just yet. We're going to need a barnstorming session soon, in the New Year now I guess, to thrash out a few improvements to the project. ID: 1510 · Rating: 0 · rate: / Reply Quote

Development for LHC@home