Message boards : CMS Application : 3h lost, why?
Message board moderation

To post messages, you must log in.

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2449 - Posted: 20 Mar 2016, 9:20:18 UTC

I looked at my last task.
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=122137
The last job in the run finished at 06.34local.
VM closed at 09.32local.

3/20/2016 3:09:25 AM | vLHCathome-dev | Sending scheduler request: To fetch work.
3/20/2016 3:09:25 AM | vLHCathome-dev | Requesting new tasks for CPU
3/20/2016 3:09:28 AM | vLHCathome-dev | Scheduler request completed: got 0 new tasks
3/20/2016 3:09:28 AM | vLHCathome-dev | No tasks sent
3/20/2016 3:09:28 AM | vLHCathome-dev | No tasks are available for CMS Simulation
3/20/2016 3:09:28 AM | vLHCathome-dev | This computer has reached a limit on tasks in progress
3/20/2016 5:53:57 AM | | Project communication failed: attempting access to reference site
3/20/2016 5:54:10 AM | | BOINC can't access Internet - check network connection or proxy configuration.
3/20/2016 7:00:51 AM | vLHCathome-dev | Sending scheduler request: To fetch work.
3/20/2016 7:00:51 AM | vLHCathome-dev | Requesting new tasks for CPU
3/20/2016 7:00:55 AM | vLHCathome-dev | Scheduler request completed: got 0 new tasks
3/20/2016 7:00:55 AM | vLHCathome-dev | No tasks sent
3/20/2016 7:00:55 AM | vLHCathome-dev | No tasks are available for CMS Simulation
3/20/2016 7:00:55 AM | vLHCathome-dev | This computer has reached a limit on tasks in progress
3/20/2016 9:32:36 AM | vLHCathome-dev | Message from task: 0
3/20/2016 9:32:37 AM | vLHCathome-dev | Computation for task wu_1458221027_1318_0 finished
3/20/2016 9:32:41 AM | vLHCathome-dev | Sending scheduler request: To report completed tasks.
3/20/2016 9:32:41 AM | vLHCathome-dev | Reporting 1 completed tasks
3/20/2016 9:32:41 AM | vLHCathome-dev | Requesting new tasks for CPU
3/20/2016 9:32:44 AM | vLHCathome-dev | Scheduler request completed: got 1 new tasks
ID: 2449 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,443,353
RAC: 11,848
Message 2451 - Posted: 20 Mar 2016, 11:11:20 UTC - in response to Message 2449.  

Looks like a local issue, checked mine and it was happily processing all through that time period and has just completed and reported back no problem.
ID: 2451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2452 - Posted: 20 Mar 2016, 11:16:00 UTC - in response to Message 2451.  

It had a internet outage from about 05.50 to 06.00 local.
This was during the last job.
There is an issue with brief internet disconnects.
I mentioned this before, but as usual, no answer.
ID: 2452 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 42
Message 2456 - Posted: 20 Mar 2016, 22:43:20 UTC - in response to Message 2452.  

Please can you remind us of the issue.
ID: 2456 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2459 - Posted: 21 Mar 2016, 7:53:51 UTC

ID: 2459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 42
Message 2464 - Posted: 21 Mar 2016, 10:10:58 UTC - in response to Message 2459.  
Last modified: 21 Mar 2016, 12:41:39 UTC

Working on it.


Edit: Done
ID: 2464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2488 - Posted: 21 Mar 2016, 19:19:34 UTC - in response to Message 2464.  

No change. Suggest tho set Claim_Worklife to default (1200sec).


I have tested 10min disconnect from internet.

It failed, connection was not reestablished.Job was finished and uploaded, but stuck in WNPostproc(Where it is going to time out and rerun by a different host).

03/18/16 16:35:48 (pid:7659) condor_write(): Socket closed when trying to write 192 bytes to <130.246.180.120:9818>, fd is 8
03/18/16 16:35:48 (pid:7659) Buf::write(): condor_write() failed
03/18/16 16:35:48 (pid:7659) Failed to send job exit status to shadow
03/18/16 16:35:48 (pid:7659) JobExit() failed, waiting for job lease to expire or for a reconnect attempt
03/18/16 16:35:48 (pid:7659) Returning from CStarter::JobReaper()

Previously, it was possible, now it is not.
Getting a new job is also a problem.Even 2h after the end of the disconnect, no new work has been started.
ID: 2488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : 3h lost, why?


©2024 CERN