Message boards : CMS Application : Task Composition
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Moving the discussion from the workplan thread to here. The question is what is the optimal way for us to structure the CMS tasks. Let's start by the task length i.e VM lifetime.
|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
12h |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
12h |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
OK. Does anyone see any advantage of running more than one glidein? i.e. should we make that one glidein last 12h? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
No and yes. |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
Only 1 glidein for 1 task, however I've seen situations where a 2nd and 3rd were started during the period last days where we only had 1 glidein for 6 hours. Maybe a 2nd and 3rd is started only after connection problems during the original glidein duration or after a several hours suspend time. |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
We can then try to have one task for 12h which only runs one glidein. The length of the job submitted is down to the submitter (currently Ivan) and has to take in to consideration what will work best i.e. large output files will cause stage-out errors. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Maybe the "accidental" task end after one run can be turned into a "deliberate". |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
Has the composition changed to 1 job in 1 run? At least after a job has finished condor_STARTD and condor_STARTER are finished with exit code 0 and condor_MASTER with exit code 99. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I reported this already. http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=139&postid=2489#2489 |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
Not sure what the cause is yet. http://www.uscms.org/SoftwareComputing/Grid/WMS/glideinWMS/doc.prd/factory/custom_vars.html#lifetime GLIDEIN_Retire_Time = 12h GLIDEIN_MAX_WALLTIME = 18h |
Send message Joined: 13 Feb 15 Posts: 1217 Credit: 906,817 RAC: 1,469 ![]() ![]() ![]() |
Has something changed? VM now running 9 hours and 25 minutes. The last 3 jobs (runs) started after the 6 hours deadline we had before. 09:11:02 +0100 2016-03-22 [INFO] Starting CMS Application - Run 1 09:11:02 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 09:11:02 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 09:11:02 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 09:11:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 09:11:03 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=1388044655 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 130:00:01 (5.4 days) 09:11:04 +0100 2016-03-22 [INFO] Downloading glidein 09:11:05 +0100 2016-03-22 [INFO] Running glidein (check logs) 10:58:01 +0100 2016-03-22 [INFO] CMS glidein Run 1 ended Copying 99047 bytes file:///home/boinc/wu_1458514252_2051_0_1.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_1.tgz Bandwidth: 13628 Short exit status: 0 10:59:02 +0100 2016-03-22 [INFO] Starting CMS Application - Run 2 10:59:02 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 10:59:02 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 10:59:02 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 10:59:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 10:59:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=1317519394 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:51:27 (5.4 days) 10:59:03 +0100 2016-03-22 [INFO] Downloading glidein 10:59:03 +0100 2016-03-22 [INFO] Running glidein (check logs) 12:09:01 +0100 2016-03-22 [INFO] CMS glidein Run 2 ended Copying 91999 bytes file:///home/boinc/wu_1458514252_2051_0_2.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_2.tgz Bandwidth: 12968 Short exit status: 0 12:10:02 +0100 2016-03-22 [INFO] Starting CMS Application - Run 3 12:10:02 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 12:10:02 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 12:10:02 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 12:10:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 12:10:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=1170323471 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 130:01:39 (5.4 days) 12:10:03 +0100 2016-03-22 [INFO] Downloading glidein 12:10:03 +0100 2016-03-22 [INFO] Running glidein (check logs) 14:00:01 +0100 2016-03-22 [INFO] CMS glidein Run 3 ended Copying 100319 bytes file:///home/boinc/wu_1458514252_2051_0_3.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_3.tgz Bandwidth: 14051 Short exit status: 0 14:01:01 +0100 2016-03-22 [INFO] Starting CMS Application - Run 4 14:01:01 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 14:01:01 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 14:01:02 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 14:01:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 14:01:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=1030334115 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:53:09 (5.4 days) 14:01:03 +0100 2016-03-22 [INFO] Downloading glidein 14:01:03 +0100 2016-03-22 [INFO] Running glidein (check logs) 15:15:01 +0100 2016-03-22 [INFO] CMS glidein Run 4 ended Copying 94286 bytes file:///home/boinc/wu_1458514252_2051_0_4.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_4.tgz Bandwidth: 13442 Short exit status: 0 15:16:01 +0100 2016-03-22 [INFO] Starting CMS Application - Run 5 15:16:01 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 15:16:01 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 15:16:01 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 15:16:01 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 15:16:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=1371725943 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:57:19 (5.4 days) 15:16:02 +0100 2016-03-22 [INFO] Downloading glidein 15:16:03 +0100 2016-03-22 [INFO] Running glidein (check logs) 16:30:01 +0100 2016-03-22 [INFO] CMS glidein Run 5 ended Copying 93421 bytes file:///home/boinc/wu_1458514252_2051_0_5.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_5.tgz Bandwidth: 13632 Short exit status: 0 16:31:01 +0100 2016-03-22 [INFO] Starting CMS Application - Run 6 16:31:02 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 16:31:02 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 16:31:02 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 16:31:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 16:31:02 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=982353893 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:53:00 (5.4 days) 16:31:03 +0100 2016-03-22 [INFO] Downloading glidein 16:31:03 +0100 2016-03-22 [INFO] Running glidein (check logs) 18:26:01 +0100 2016-03-22 [INFO] CMS glidein Run 6 ended Copying 95622 bytes file:///home/boinc/wu_1458514252_2051_0_6.tgz => https://data-bridge-test.cern.ch/myfed/moutputs/wu_1458514252_2051_0_6.tgz Bandwidth: 13855 Short exit status: 0 18:27:01 +0100 2016-03-22 [INFO] Starting CMS Application - Run 7 18:27:01 +0100 2016-03-22 [INFO] Reading the BOINC volunteer's information 18:27:01 +0100 2016-03-22 [INFO] Volunteer: Crystal Pellet (1784) Host: 29799 18:27:01 +0100 2016-03-22 [INFO] VMID: e9a40930-863c-4e95-b27a-44abf7940b9c 18:27:01 +0100 2016-03-22 [INFO] Requesting an X509 credential from CMS-Dev 18:27:01 +0100 2016-03-22 [INFO] Requesting an X509 credential from LHC@home subject : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784/CN=110729394 issuer : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 identity : /O=Volunteer Computing/O=CERN/CN=CrystalPellet 1784 type : RFC 3820 compliant impersonation proxy strength : 1024 bits path : /tmp/x509up_u500 timeleft : 129:43:34 (5.4 days) 18:27:02 +0100 2016-03-22 [INFO] Downloading glidein 18:27:02 +0100 2016-03-22 [INFO] Running glidein (check logs) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
They tried to set a "run" or task from 6h to 12h. It does not work correctly, but they are not doing anything about it. (Like changing it back) |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1128 Credit: 339,230 RAC: 19 ![]() |
It is probably due to setting GLIDEIN_MAX_WALLTIME=18h, it was not set before. Have requested for it to be unset. Will update the change log once it has been done. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
I just had a task finish at just over 24h run time. 17 runs(jobs), but task log only shows 9. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
I just had a task finish at just over 24h run time. I guess you mean at beta; you've had no tasks here today. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Correct. But it is exactly the same, isn't it? http://lhcathome2.cern.ch/vLHCathome/result.php?resultid=5508253 |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Correct. Tja. All the logs I have for you are status 0. Your task log shows one job per task up until 09:27 (nine jobs) and then nothing for the rest of the task until it finished at 21:32. Now in fact, I have job logs for you from that missing period: 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.311.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 09:27:31 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.413.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 10:52:36 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.412.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 11:00:08 GMT 2016 on 8879-82695-28417 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.553.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 12:12:45 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.560.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 12:22:00 GMT 2016 on 8879-82695-28417 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.674.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 13:28:14 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.698.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 13:44:58 GMT 2016 on 8879-82695-28417 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.804.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 14:54:14 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.824.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 15:10:18 GMT 2016 on 8879-82695-28417 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.909.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 16:17:10 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.501.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 16:59:01 GMT 2016 on 8879-82695-9870 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1015.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 17:43:16 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1079.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 18:28:00 GMT 2016 on 8879-82695-9870 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1181.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 18:57:52 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1254.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 19:51:10 GMT 2016 on 8879-82695-9870 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1302.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 20:21:53 GMT 2016 on 8879-82695-14135 with (short) status 0 ======== 160321_213045:ireid_crab_CMS_at_Home_MinBias_250ev10K/job_out.1386.0.txt:======== gWMS-CMSRunAnalysis.sh FINISHING at Tue Mar 22 21:13:39 GMT 2016 on 8879-82695-9870 with (short) status 0 ======== but some of them are rather short. Ah! We're interleaving outputs from two VMs, 8879-82695-14135 and 8879-82695-28417. I don't see any problem our end yet, it seems like some output logs weren't excerpted into the stderr log. ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks Ivan. I kind of thought, it was some kind of logging problem. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Thanks Ivan. If I were a betting man, I'd put my money on a subtle bug in outputs=$(find /var/www/html/logs/run-${i}/$(basename ${glidein})/ -name _condor_stdout -type f -printf '%T@ %p\n' | sort -n | cut -d " " -f2) but I'm no bash expert. Is there something in there that only matches single-digit job results? ...since only jobs 1-9 were reported... ![]() |
©2025 CERN