Message boards : Number crunching : Heartbeat
Message board moderation

To post messages, you must log in.

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4473 - Posted: 8 Dec 2016, 7:01:50 UTC

I have disabled this function and it works fine.
My suggestion is to disable it by default, as it causes more problems than it cures.
Comments?
ID: 4473 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 4474 - Posted: 8 Dec 2016, 9:06:29 UTC - in response to Message 4473.  

I never had problems with VM's stopping due to no heartbeat on time and my system is really not underloaded ;)

Maybe the reason for that is that my VM's always run with priority 'below normal'.
ID: 4474 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,252
RAC: 130
Message 4477 - Posted: 12 Dec 2016, 8:33:20 UTC - in response to Message 4473.  

The heartbeat is there as a protection mechanism for hanging/frozen VMs and is working well. There are a few false positives and this is something that we will investigate soon.
ID: 4477 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,620
RAC: 3,116
Message 5223 - Posted: 29 Oct 2017, 6:33:58 UTC

2017-10-29 02:11:28 (11632): Guest Log: [INFO] Job finished in slot1 with 200.
2017-10-29 02:11:32 (11632): Guest Log: [INFO] New Job Starting in slot1
2017-10-29 02:11:32 (11632): Guest Log: [INFO] Condor JobID: 4937643.52 in slot1
2017-10-29 02:11:44 (11632): Guest Log: [INFO] Starting pilot in slot1
2017-10-29 02:16:15 (11632): VM Heartbeat file specified, but missing heartbeat.
2017-10-29 02:16:15 (11632): Powering off VM.
2017-10-29 02:21:27 (11632): VM did not power off when requested.
2017-10-29 02:21:27 (11632): VM was NOT successfully terminated.
2017-10-29 02:21:27 (11632): Deregistering VM. (boinc_cb5acb2e8204d4d1, slot#3)

A Heartbeat Error allaround of the Projects!
Because of the Change from Summertime to CET?
ID: 5223 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 750
Credit: 11,603,490
RAC: 1,713
Message 5224 - Posted: 29 Oct 2017, 7:47:07 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=366806

I got one of those on a task that was just over 12 hours run time.

(once in a while I get those over at LHC too)
ID: 5224 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,620
RAC: 3,116
Message 5225 - Posted: 29 Oct 2017, 9:00:33 UTC

Atlas did not crashed with Heartbeat in this time, but CMS, LHCb and Theory also -dev and production, for my Computer.
ID: 5225 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Heartbeat


©2024 CERN