Message boards : Theory Application : New Version v2.4
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 3960 - Posted: 5 Aug 2016, 11:33:50 UTC

This new version provides HTCondor 8.4.8 using the standard CernVM production distribution. It should finally solve the suspend/resume problems and is half the size of the previous image.
ID: 3960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 25
Message 3977 - Posted: 6 Aug 2016, 9:50:06 UTC - in response to Message 3960.  

Resume after a suspend period of over 30,000 seconds of inactivity on all 4 saved VM's went fine.
All saved jobs finishing fine after the resume and new jobs started.

After the last finished during the >12 hours elapsed time and the 5 minutes idle time some Error remarks in StartLog.
Seems like cleanup dust:

08/06/16 10:37:52 Got activate_claim request from shadow (188.184.187.167)
08/06/16 10:37:52 Remote job ID is 1461227.0
08/06/16 10:37:52 Got universe "VANILLA" (5) from request classad
08/06/16 10:37:52 State change: claim-activation protocol successful
08/06/16 10:37:52 Changing activity: Idle -> Busy
08/06/16 11:31:36 Called deactivate_claim_forcibly()
08/06/16 11:31:36 Starter pid 36909 exited with status 0
08/06/16 11:31:36 State change: starter exited
08/06/16 11:31:36 Changing activity: Busy -> Idle
08/06/16 11:31:36 State change: START is false
08/06/16 11:31:36 Changing state and activity: Claimed/Idle -> Preempting/Vacating
08/06/16 11:31:36 State change: No preempting claim, returning to owner
08/06/16 11:31:36 Changing state and activity: Preempting/Vacating -> Owner/Idle
08/06/16 11:31:36 State change: IS_OWNER is false
08/06/16 11:31:36 Changing state: Owner -> Unclaimed
08/06/16 11:31:36 Error: can't find resource with ClaimId (<10.0.2.15:14809>#1470431871#1#...) for 444 (ACTIVATE_CLAIM)
08/06/16 11:31:36 Error: can't find resource with ClaimId (<10.0.2.15:14809>#1470431871#1#...) -- perhaps this claim was already removed?
08/06/16 11:31:36 Error: problem finding resource for 403 (DEACTIVATE_CLAIM)
ID: 3977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 334,882
RAC: 0
Message 3992 - Posted: 7 Aug 2016, 19:27:45 UTC - in response to Message 3977.  

The CLAIM_WORKLIFE value on the server has been increased from 1200s to 86400s. This will hopefully remove that error message. As the image has not changed and seems fine, it will be released to production.
ID: 3992 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4005 - Posted: 8 Aug 2016, 16:33:05 UTC
Last modified: 8 Aug 2016, 16:37:12 UTC

I have a vm, that is clearly running, but the vm-box-manager shows it as "powered off". How can that be?

This might be the reason for the "heartbeat" problem.

Shutting down boinc fully, put the VM into "saved" state.

Restarting Boinc started the VM normally.
ID: 4005 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Theory Application : New Version v2.4


©2024 CERN