Message boards :
News :
No new jobs
Message board moderation
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
I just checked your record of jobs and tasks. Your tasks are taking 5-7 days to complete! In fact your last task ran from 18 Oct 2015, 11:25:39 UTC to 23 Oct 2015, 22:17:34 UTC but only successfully returned one job, processed between Oct 20 19:39:15 GMT and Oct 20 20:58:24 GMT. Do you leave your computer switched off most of the time? This sort of processing is not greatly suited for part-time computing, as network sockets, etc. get lost when it is disturbed |
Send message Joined: 15 Feb 15 Posts: 10 Credit: 16,387 RAC: 0 |
Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then Marcus |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then OK, Marcus. Thanks for trying. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Job queue is nearly empty! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
Job queue is nearly empty! Yep, just noticed. It would have lasted until tomorrow if Laurence et al. hadn't fixed that pesky bug! |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
what does this mean ? 10/28/15 18:27:33 (pid:22064) Create_Process succeeded, pid=22068 10/28/15 19:07:32 (pid:22064) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 10/28/15 19:07:32 (pid:22064) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 10/28/15 19:08:33 (pid:22064) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#108698 10/28/15 19:12:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:12:45 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:17:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:17:45 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:22:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:22:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:27:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:27:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:32:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:32:46 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:37:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:37:47 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:42:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:42:47 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:47:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:47:48 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:52:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:52:48 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:52 (pid:22064) Process exited, pid=22068, status=0 10/28/15 19:55:52 (pid:22064) condor_write(): Socket closed when trying to write 617 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:55:52 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:53 (pid:22064) condor_write(): Socket closed when trying to write 190 bytes to <130.246.180.120:9818>, fd is 11 10/28/15 19:55:53 (pid:22064) Buf::write(): condor_write() failed 10/28/15 19:55:53 (pid:22064) Failed to send job exit status to shadow 10/28/15 19:55:53 (pid:22064) JobExit() failed, waiting for job lease to expire or for a reconnect attempt 10/28/15 19:55:53 (pid:22064) Returning from CStarter::JobReaper() |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
what does this mean ? I'll ask Andrew to check his logs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. [Added] From a later message, one theory is having a changed IP address. [/Added] |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 13 |
You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. Question not meant for me, but I'm maybe allowed to interrupt. Although I've a static IP, I regularly see jobnumbers send to my machine in the 1st attempt with a WNIp not belonging to me. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers. If I may add to this.I see this, too although not often.I think that these jobs have been started but "abandoned" by the host whose public IP you see. It's the other side of this. We need to treat IPs with care. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 13 |
If I may add to this.I see this, too although not often.I think that these In that case one would expect to see that I'm the 2nd attempt. I agree that showing user´s (static) IP to everyone on the web could be a reason for me to detach this project from my project´s list. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
No, it doesn't show up as a retry. Assuming you complete the job successfully, the final Dashboard details show the original host IP with your timestamps. Perhaps we should make this a separate thread. Edit. and, yes, we are showing our IPs to the world+dog. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
Still no resolution to the current stage-out failures so I'll not be submitting any new jobs until there is an explanation. More news tomorrow, I hope. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
It's looking like the jobs are succeeding, but are resubmitted, and the failures come when the second job tries to write a result file that already exists. I'm hoping the experts can sort it out... |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week. Thanks to the CERN crew for finding and fixing several problems. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week. I'm not particularly happy with the failure rate in the Jobs graphs. I suspect we still have the post-processing problem. I'll know better when I analyse the logs tomorrow. I'll probably continue for a while with these short jobs, to minimise time lost in possible unnecessary retries. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
We are starting to run low on jobs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
We are starting to run low on jobs. Yes, I know. I'll submit more later. The failure rate is high the last few days, due in part to one particular machine which had a strange error. I had the owner abort the task and start a new one -- same problem. Then I had him reset the project and since then (around 0000 GMT) there's only been one result from it, a failure but just one in 12 hours rather than the previous three or four per hour. There must be another as well, as the transient failure rates are still high. We really need to abort tasks if there is an error return in a job, if only to slow down the error rate! [Edit] Have heard back from the owner -- newly installed security software was blocking access to the "frontier" (conditions database) server so the jobs couldn't retrieve data on the detector's configuration. Crossed fingers... (Error code 66 for future reference.) [/Edit] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks ivan, It would be nice to a´have a simple way to see, how many successful/failed jobs a host actually has. Tracing it on dashbord is somewhat cumbersome. I am running 96hour cms-tasks, as each job takes about 7-8h. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 62 |
I agree, but I wouldn't put it as a high priority just yet. We're going to need a barnstorming session soon, in the New Year now I guess, to thrash out a few improvements to the project. |
©2024 CERN