Message boards :
CMS Application :
Task Composition
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Even with the new version, it still has only one job per run. Glidein_stderror: max wall time, 64800 |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
I can't see an obvious error but I might be onto it. In any case this will all change once we have one run. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
We are waiting on an update to the configuration on the Condor server at RAL. The admins their have to juggle a lot of things so we just have to be patient. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
All good things come to those who wait. An update should be available in a few hours. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The intended 12h per task is not working. A tasks exits after a few minutes: http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=124813 Or it has "runs" from 2h10min to 3h47min length (my observation) with a max of 3 runs per task. In any case: It is not working! Longest running task: 10.5hours exit reason: Guest Log: [INFO] No more jobs. Shutting down! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148 |
Did you manage to grab the glidein stderr? That looks similar to our little "Oops!" yesterday. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No, sorry. Here is one of the currently running task. ----------------------------------------------------- used param defined retire time, 43200 using default retire spread, 4320 Retire time set to 41386 Die time set to 170986 Unless you wanted to see something else? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I suggest to double the value for glidein_max_tail and glidein_max_idle. After starting up, the glidein will wait for Glidein_Max_Idle for its initial matching. If it is Idle for longer than this, it will assume no jobs are available and will shutdown. EDIT:All tasks, that ended too early had "No more Jobs" as exit reason. That would match exactly the condition mentioned above. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
Here is the relevant line from the StardLog of that task. 03/24/16 10:30:24 (pid:7665) Changing state: Unclaimed -> Claimed |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
I am going to revert back to the configuration that we had last week when we were running for 6 hours and remove one of the checks that is clearly not working. We shouldn't really be testing these tuning values like we are here. After 4th April, the new Test4Thoery application will be uploaded here which also uses HTCondor but without the glidein. We can then essentially deploy a standard batch node and experiment with the values. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks. Well, hard to tell. However, it is worth a shot, as a lot values were doubled to accommodate the change from 6h to 12h. Unless you have a better idea? Alternatively, revert back to 6h, as this worked quite well. |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
OK, I have just reverted the configuration to what we introduced 4th March. CCB_HEARTBEAT_INTERVAL = 0 NOT_RESPONDING_TIMEOUT = 10800 GLIDEIN_Retire_Time = 21600 (6h, the default) GLIDEIN_MAX_WALLTIME = unset CLAIM_WORKLIFE = ifThenElse(DynamicSlot =?= true,3600,-1) |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129 |
There is a problem with the update. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Somthing is not right: (EDIT:only virtualLHC tasks) 3 runs are done, one after the other. Only the 3rd run is actually doing a job. Cron_stderr: touch: cannot touch `/home/boinc/shared/heartbeat': No such file or directory rm: cannot remove `/home/boinc/CMSRun/glide_Tjhi9N/main': Directory not empty /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected /cvmfs/cms.cern.ch/CMS@Home/agent/CMSJobAgent.sh: line 244: [: CMSRun/glide_BZZ8yu: binary operator expected |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Is that correct? glidein_stderr: GLIDECLIENT_Group_Hold=False GLIDECLIENT_Group_PREEMPT=False GLIDECLIENT_Rank=1 GLIDECLIENT_Group_Rank=1 GLIDEIN_CLAIM_WORKLIFE_DYNAMIC=3600 GLIDEIN_CLAIM_WORKLIFE=-1 GLIDEIN_Factory="factory_service" GLIDEIN_Name="v3_2_7" GLIDEIN_CredentialIdentifier="GLIDEIN_CredentialIdentifier " default is 1200 sec |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I just hat a task finishing at exactly 18h runtime. It was in the middle of a job, when it shut down. No reason for shutdown in the log. The log does not show the last run jobs. It was working on the 10th job in total(4th in this run). http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=125039 |
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466 |
So max wall time of 64800 is functioning faster than vboxwrapper's job_duration of 64800, else "VM Completion File Detected" would have been written into the result. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
In any case it is a termination, not a proper shutdown. |
©2024 CERN