Message boards : News : No new jobs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1324 - Posted: 24 Oct 2015, 12:46:02 UTC - in response to Message 1321.  

I just checked your record of jobs and tasks. Your tasks are taking 5-7 days to complete! In fact your last task ran from 18 Oct 2015, 11:25:39 UTC to 23 Oct 2015, 22:17:34 UTC but only successfully returned one job, processed between Oct 20 19:39:15 GMT and Oct 20 20:58:24 GMT. Do you leave your computer switched off most of the time? This sort of processing is not greatly suited for part-time computing, as network sockets, etc. get lost when it is disturbed
ID: 1324 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
newman

Send message
Joined: 15 Feb 15
Posts: 10
Credit: 16,387
RAC: 0
Message 1325 - Posted: 24 Oct 2015, 12:54:03 UTC - in response to Message 1324.  

Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then


Marcus
ID: 1325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1327 - Posted: 24 Oct 2015, 13:00:16 UTC - in response to Message 1325.  

Yes, my computer is not running 24/7. But for sure mostly running more than 2 or 3 hours running a day. So good to hear that my way of participating in boinc not suites this project. I will choose another project then


Marcus

OK, Marcus. Thanks for trying.
ID: 1327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1357 - Posted: 28 Oct 2015, 16:36:06 UTC

Job queue is nearly empty!
ID: 1357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1358 - Posted: 28 Oct 2015, 16:38:42 UTC - in response to Message 1357.  

Job queue is nearly empty!

Yep, just noticed. It would have lasted until tomorrow if Laurence et al. hadn't fixed that pesky bug!
ID: 1358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1361 - Posted: 28 Oct 2015, 19:19:00 UTC

what does this mean ?

10/28/15 18:27:33 (pid:22064) Create_Process succeeded, pid=22068
10/28/15 19:07:32 (pid:22064) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
10/28/15 19:07:32 (pid:22064) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
10/28/15 19:08:33 (pid:22064) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#108698
10/28/15 19:12:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:12:45 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:17:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:17:45 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:22:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:22:46 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:27:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:27:46 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:32:46 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:32:46 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:37:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:37:47 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:42:47 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:42:47 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:47:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:47:48 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:52:48 (pid:22064) condor_write(): Socket closed when trying to write 563 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:52:48 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:55:52 (pid:22064) Process exited, pid=22068, status=0
10/28/15 19:55:52 (pid:22064) condor_write(): Socket closed when trying to write 617 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:55:52 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:55:53 (pid:22064) condor_write(): Socket closed when trying to write 190 bytes to <130.246.180.120:9818>, fd is 11
10/28/15 19:55:53 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:55:53 (pid:22064) Failed to send job exit status to shadow
10/28/15 19:55:53 (pid:22064) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/28/15 19:55:53 (pid:22064) Returning from CStarter::JobReaper()
ID: 1361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1363 - Posted: 28 Oct 2015, 19:55:48 UTC - in response to Message 1361.  

what does this mean ?

10/28/15 18:27:33 (pid:22064) Create_Process succeeded, pid=22068
10/28/15 19:07:32 (pid:22064) CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
10/28/15 19:07:32 (pid:22064) CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
10/28/15 19:08:33 (pid:22064) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#108698
10/28/15 19:12:45 (pid:22064) condor_write(): Socket closed when trying to write 562 bytes to <130.246.180.120:9818>, fd is 11
...
10/28/15 19:55:53 (pid:22064) Buf::write(): condor_write() failed
10/28/15 19:55:53 (pid:22064) Failed to send job exit status to shadow
10/28/15 19:55:53 (pid:22064) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
10/28/15 19:55:53 (pid:22064) Returning from CStarter::JobReaper()

I'll ask Andrew to check his logs.
ID: 1363 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1366 - Posted: 29 Oct 2015, 10:06:51 UTC - in response to Message 1361.  
Last modified: 29 Oct 2015, 10:11:53 UTC

You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers.

[Added] From a later message, one theory is having a changed IP address. [/Added]
ID: 1366 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 15
Message 1371 - Posted: 29 Oct 2015, 10:55:28 UTC - in response to Message 1366.  

You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers.

[Added] From a later message, one theory is having a changed IP address. [/Added]

Question not meant for me, but I'm maybe allowed to interrupt.
Although I've a static IP, I regularly see jobnumbers send to my machine in the 1st attempt with a WNIp not belonging to me.
ID: 1371 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1372 - Posted: 29 Oct 2015, 13:13:45 UTC - in response to Message 1371.  
Last modified: 29 Oct 2015, 13:17:23 UTC

You don't happen to know what the external IP address was for the machine at the time? Our database shows several IPs for your machines, spread over two providers.

[Added] From a later message, one theory is having a changed IP address. [/Added]

Question not meant for me, but I'm maybe allowed to interrupt.
Although I've a static IP, I regularly see jobnumbers send to my machine in the 1st attempt with a WNIp not belonging to me.

If I may add to this.I see this, too although not often.I think that these
jobs have been started but "abandoned" by the host whose public IP you see.
It's the other side of this. We need to treat IPs with
care.
ID: 1372 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 15
Message 1374 - Posted: 29 Oct 2015, 13:33:42 UTC - in response to Message 1372.  

If I may add to this.I see this, too although not often.I think that these
jobs have been started but "abandoned" by the host whose public IP you see.
It's the other side of this. We need to treat IPs with
care.

In that case one would expect to see that I'm the 2nd attempt.

I agree that showing user´s (static) IP to everyone on the web could be a reason for me to detach this project from my project´s list.
ID: 1374 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 0
Message 1375 - Posted: 29 Oct 2015, 13:37:12 UTC - in response to Message 1374.  
Last modified: 29 Oct 2015, 14:04:59 UTC

No, it doesn't show up as a retry. Assuming you complete the job successfully,
the final Dashboard details show the original host IP with your timestamps.
Perhaps we should make this a separate thread.

Edit. and, yes, we are showing our IPs to the world+dog.
ID: 1375 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1437 - Posted: 8 Nov 2015, 16:01:05 UTC

Still no resolution to the current stage-out failures so I'll not be submitting any new jobs until there is an explanation. More news tomorrow, I hope.
ID: 1437 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1438 - Posted: 9 Nov 2015, 12:48:38 UTC - in response to Message 1437.  

It's looking like the jobs are succeeding, but are resubmitted, and the failures come when the second job tries to write a result file that already exists. I'm hoping the experts can sort it out...
ID: 1438 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1439 - Posted: 12 Nov 2015, 12:30:59 UTC - in response to Message 1438.  

OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week.

Thanks to the CERN crew for finding and fixing several problems.
ID: 1439 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1446 - Posted: 12 Nov 2015, 21:56:26 UTC - in response to Message 1439.  

OK, good news. After several days of being unable to submit jobs, I finally got a batch through this morning. These are shorter jobs (25 events) while I check to see if we still have that mysterious rash of failures from last week.

I'm not particularly happy with the failure rate in the Jobs graphs. I suspect we still have the post-processing problem. I'll know better when I analyse the logs tomorrow. I'll probably continue for a while with these short jobs, to minimise time lost in possible unnecessary retries.
ID: 1446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1507 - Posted: 13 Dec 2015, 12:12:01 UTC

We are starting to run low on jobs.
ID: 1507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1508 - Posted: 13 Dec 2015, 12:29:50 UTC - in response to Message 1507.  
Last modified: 13 Dec 2015, 12:48:13 UTC

We are starting to run low on jobs.

Yes, I know. I'll submit more later. The failure rate is high the last few days, due in part to one particular machine which had a strange error. I had the owner abort the task and start a new one -- same problem. Then I had him reset the project and since then (around 0000 GMT) there's only been one result from it, a failure but just one in 12 hours rather than the previous three or four per hour. There must be another as well, as the transient failure rates are still high. We really need to abort tasks if there is an error return in a job, if only to slow down the error rate!

[Edit] Have heard back from the owner -- newly installed security software was blocking access to the "frontier" (conditions database) server so the jobs couldn't retrieve data on the detector's configuration. Crossed fingers... (Error code 66 for future reference.) [/Edit]
ID: 1508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1509 - Posted: 13 Dec 2015, 12:50:16 UTC

Thanks ivan,
It would be nice to a´have a simple way to see, how many successful/failed jobs a host actually has.
Tracing it on dashbord is somewhat cumbersome.
I am running 96hour cms-tasks, as each job takes about 7-8h.
ID: 1509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 75
Message 1510 - Posted: 13 Dec 2015, 15:50:34 UTC - in response to Message 1509.  

I agree, but I wouldn't put it as a high priority just yet.
We're going to need a barnstorming session soon, in the New Year now I guess, to thrash out a few improvements to the project.
ID: 1510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 3 · 4 · 5 · 6 · 7 · 8 · 9 . . . 13 · Next

Message boards : News : No new jobs


©2024 CERN