Message boards : CMS Application : Task Composition
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2523 - Posted: 23 Mar 2016, 11:39:31 UTC

Even with the new version, it still has only one job per run.

Glidein_stderror:

max wall time, 64800
WARNING: job max time is bigger than max_walltime, lowering it.
job max time, 63644
calculated retire time, 666
using default retire spread, 66
Retire time set to 637
Die time set to 64281
ID: 2523 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2524 - Posted: 23 Mar 2016, 12:13:25 UTC - in response to Message 2521.  

I can't see an obvious error but I might be onto it. In any case this will all change once we have one run.
ID: 2524 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2525 - Posted: 23 Mar 2016, 12:14:40 UTC - in response to Message 2523.  

We are waiting on an update to the configuration on the Condor server at RAL. The admins their have to juggle a lot of things so we just have to be patient.
ID: 2525 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2528 - Posted: 23 Mar 2016, 14:12:27 UTC - in response to Message 2525.  
Last modified: 23 Mar 2016, 14:12:36 UTC

All good things come to those who wait. An update should be available in a few hours.
ID: 2528 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2543 - Posted: 24 Mar 2016, 10:07:12 UTC
Last modified: 24 Mar 2016, 10:11:28 UTC

The intended 12h per task is not working.

A tasks exits after a few minutes:
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=124813

Or it has "runs" from 2h10min to 3h47min length (my observation) with a max of 3 runs per task.

In any case: It is not working!

Longest running task: 10.5hours exit reason:
Guest Log: [INFO] No more jobs. Shutting down!
ID: 2543 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2545 - Posted: 24 Mar 2016, 10:55:57 UTC - in response to Message 2543.  

Did you manage to grab the glidein stderr? That looks similar to our little "Oops!" yesterday.
ID: 2545 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2546 - Posted: 24 Mar 2016, 11:06:01 UTC

No, sorry.

Here is one of the currently running task.

-----------------------------------------------------
used param defined retire time, 43200
using default retire spread, 4320
Retire time set to 41386
Die time set to 170986

Unless you wanted to see something else?
ID: 2546 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2547 - Posted: 24 Mar 2016, 11:53:13 UTC
Last modified: 24 Mar 2016, 12:11:29 UTC

I suggest to double the value for glidein_max_tail and glidein_max_idle.


After starting up, the glidein will wait for Glidein_Max_Idle for its initial matching. If it is Idle for longer than this, it will assume no jobs are available and will shutdown.



EDIT:All tasks, that ended too early had "No more Jobs" as exit reason.
That would match exactly the condition mentioned above.
ID: 2547 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2548 - Posted: 24 Mar 2016, 12:39:50 UTC - in response to Message 2543.  
Last modified: 24 Mar 2016, 12:40:13 UTC

Here is the relevant line from the StardLog of that task.

03/24/16 10:30:24 (pid:7665) Changing state: Unclaimed -> Claimed
03/24/16 10:30:28 (pid:7665) The DaemonShutdown expression "(DynamicSlot =!= True) && ((((GLIDEIN_ToDie =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToDie)) || ( (Slot1_Activity == "Idle") && (((GLIDEIN_Max_Idle =!= UNDEFINED) && ifThenElse((Slot1_JobStarts =!= UNDEFINED), ((Slot1_JobStarts =!= UNDEFINED) && (Slot1_SelfMonitorAge =!= UNDEFINED) && (GLIDEIN_Max_Idle =!= UNDEFINED) && (Slot1_JobStarts == 0) && (Slot1_SelfMonitorAge > GLIDEIN_Max_Idle)), ((Slot1_TotalTimeUnclaimedIdle =!= UNDEFINED) && (GLIDEIN_Max_Idle =!= UNDEFINED) && ((PartitionableSlot =!= True) || (TotalSlots =?=1)) && (Slot1_TotalTimeUnclaimedIdle > GLIDEIN_Max_Idle)))) || ((GLIDEIN_Max_Tail =!= UNDEFINED) && ifThenElse((Slot1_JobStarts =!= UNDEFINED), ((Slot1_JobStarts =!= UNDEFINED) && (Slot1_ExpectedMachineGracefulDrainingCompletion =!= UNDEFINED) && (GLIDEIN_Max_Tail =!= UNDEFINED) && (Slot1_JobStarts > 0) && ((CurrentTime - Slot1_ExpectedMachineGracefulDrainingCompletion) > GLIDEIN_Max_Tail) ), ((Slot1_TotalTimeUnclaimedIdle =!= UNDEFINED) && (GLIDEIN_Max_Tail =!= UNDEFINED) && (Slot1_TotalTimeClaimedBusy =!= UNDEFINED) && ((PartitionableSlot =!= True) || (TotalSlots =?=1)) && (Slot1_TotalTimeUnclaimedIdle > GLIDEIN_Max_Tail)))) || ((GLIDEIN_ToRetire =!= UNDEFINED) && (CurrentTime > GLIDEIN_ToRetire ))) )))" evaluated to TRUE: starting graceful shutdown
ID: 2548 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2549 - Posted: 24 Mar 2016, 12:46:49 UTC - in response to Message 2547.  

I am going to revert back to the configuration that we had last week when we were running for 6 hours and remove one of the checks that is clearly not working.

We shouldn't really be testing these tuning values like we are here. After 4th April, the new Test4Thoery application will be uploaded here which also uses HTCondor but without the glidein. We can then essentially deploy a standard batch node and experiment with the values.
ID: 2549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2550 - Posted: 24 Mar 2016, 12:53:32 UTC - in response to Message 2548.  

Thanks.
Well, hard to tell. However, it is worth a shot, as a lot values were doubled to accommodate the change from 6h to 12h.

Unless you have a better idea?

Alternatively, revert back to 6h, as this worked quite well.
ID: 2550 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2551 - Posted: 24 Mar 2016, 12:56:01 UTC - in response to Message 2549.  

OK, I have just reverted the configuration to what we introduced 4th March.

CCB_HEARTBEAT_INTERVAL = 0
NOT_RESPONDING_TIMEOUT = 10800
GLIDEIN_Retire_Time = 21600 (6h, the default)
GLIDEIN_MAX_WALLTIME = unset
CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1)
ID: 2551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2554 - Posted: 24 Mar 2016, 14:37:46 UTC - in response to Message 2551.  

There is a problem with the update.
ID: 2554 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2558 - Posted: 24 Mar 2016, 20:08:26 UTC
Last modified: 24 Mar 2016, 20:37:38 UTC

Somthing is not right: (EDIT:only virtualLHC tasks)

3 runs are done, one after the other. Only the 3rd run is actually doing a job.

Cron_stderr:
touch: cannot touch `/home/boinc/shared/heartbeat': No such file or directory
rm: cannot remove `/home/boinc/CMSRun/glide_Tjhi9N/main': Directory not empty
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
/cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected
ID: 2558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2559 - Posted: 24 Mar 2016, 20:19:13 UTC
Last modified: 24 Mar 2016, 20:20:13 UTC

Is that correct?

glidein_stderr:

GLIDECLIENT_Group_Hold=False
GLIDECLIENT_Group_PREEMPT=False
GLIDECLIENT_Rank=1
GLIDECLIENT_Group_Rank=1
GLIDEIN_CLAIM_WORKLIFE_DYNAMIC=3600
GLIDEIN_CLAIM_WORKLIFE=-1
GLIDEIN_Factory="factory_service"
GLIDEIN_Name="v3_2_7"
GLIDEIN_CredentialIdentifier="GLIDEIN_CredentialIdentifier "


default is 1200 sec
ID: 2559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2565 - Posted: 25 Mar 2016, 14:42:09 UTC
Last modified: 25 Mar 2016, 14:42:55 UTC

I just hat a task finishing at exactly 18h runtime.
It was in the middle of a job, when it shut down.
No reason for shutdown in the log.

The log does not show the last run jobs.
It was working on the 10th job in total(4th in this run).

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=125039
ID: 2565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 2566 - Posted: 25 Mar 2016, 15:33:44 UTC - in response to Message 2565.  

So max wall time of 64800 is functioning faster than vboxwrapper's job_duration of 64800,
else "VM Completion File Detected" would have been written into the result.
ID: 2566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2567 - Posted: 25 Mar 2016, 15:54:01 UTC - in response to Message 2566.  

In any case it is a termination, not a proper shutdown.
ID: 2567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : Task Composition


©2024 CERN