Message boards : LHCb Application : Errors
Message board moderation

To post messages, you must log in.

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4458 - Posted: 5 Dec 2016, 15:21:18 UTC

I am getting these errors:



12/05/16 16:12:01 (pid:4505) ReliSock::put_file_with_permissions(): Failed to stat file '/var/lib/condor/execute/dir_4505/pilot.out': No such file or directory (errno: 2, si_error: 1)
12/05/16 16:12:02 (pid:4505) DoUpload: (Condor error code 13, subcode 2) STARTER at 10.0.2.15 failed to send file(s) to <188.184.94.254:9618>: error reading from /var/lib/condor/execute/dir_4505/pilot.out: (errno 2) No such file or directory; SHADOW failed to receive file(s) from <95.222.151.3:52834>
12/05/16 16:12:02 (pid:4505) JICShadow::notifyJobTermination(): Sending mock terminate event.
12/05/16 16:12:02 (pid:4505) JIC::transferOutput() failed, waiting for job lease to expire or for a reconnect attempt
12/05/16 16:12:02 (pid:4505) Returning from CStarter::JobReaper()
12/05/16 16:12:02 (pid:4505) Got SIGQUIT. Performing fast shutdown.
12/05/16 16:12:02 (pid:4505) ShutdownFast all jobs.
12/05/16 16:12:02 (pid:4505) Lost connection to shadow, waiting 86300 secs for reconnect
12/05/16 16:12:02 (pid:4505) Failed to send job exit status to shadow
12/05/16 16:12:02 (pid:4505) **** condor_starter (condor_STARTER) pid 4505 EXITING WITH STATUS 0
ID: 4458 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 4459 - Posted: 5 Dec 2016, 16:10:56 UTC - in response to Message 4458.  

Thanks for reporting this. We have noticed them too. Am investigating ...
ID: 4459 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4463 - Posted: 6 Dec 2016, 11:33:54 UTC

Appears to be working again, as of about 1h ago.
ID: 4463 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 4464 - Posted: 6 Dec 2016, 12:12:38 UTC - in response to Message 4463.  

Yes, we have been having some trouble with the LHCb submission recently. On top of that it looks like an LHCb software version was bumped and the application was removed from CVMFS. The error you saw was due to the pilots failing and so the output not being available.
ID: 4464 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4466 - Posted: 6 Dec 2016, 18:04:48 UTC

Tasks failing after about 3.5h-- Heartbeat missing.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=291697
ID: 4466 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 4470 - Posted: 6 Dec 2016, 22:26:13 UTC - in response to Message 4466.  
Last modified: 6 Dec 2016, 22:26:56 UTC

There is this interesting post on the topic. Once the consolidation is stable, we can focus on things like this.
ID: 4470 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4471 - Posted: 7 Dec 2016, 12:37:02 UTC

How can it be, that after 13h it decides, that the heartbeat is missing???

The only way to get a valid result is to shut a task down manually, before it reaches the end.


http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=291794
ID: 4471 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 4478 - Posted: 12 Dec 2016, 8:36:35 UTC - in response to Message 4471.  

The VM freezes. This could be related to the post. If the system is loaded in such a way that the internal clock slows down, this could result in the heartbeat slowing down to a point where the external monitor thinks it is dead.
ID: 4478 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4480 - Posted: 12 Dec 2016, 12:33:23 UTC - in response to Message 4478.  

Would it not be better to check the heartbeat file for ANY change?
Even if it is not up to date, but if it has changed since the last time it was checked, that should be sufficient to declare the VM is still alive.
ID: 4480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4481 - Posted: 12 Dec 2016, 13:03:32 UTC - in response to Message 4478.  
Last modified: 12 Dec 2016, 13:05:19 UTC

Disregard this post.
ID: 4481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4485 - Posted: 14 Dec 2016, 9:34:59 UTC

The "LHCb jobs" graphs are not working.
ID: 4485 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : LHCb Application : Errors


©2024 CERN