Thread 'New Version v47.90'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1129 Credit: 339,231 RAC: 0	Message 4544 - Posted: 20 Dec 2016, 10:11:24 UTC This new version should exclude the Merge jobs. Please report any jobs that have higher than normal bandwidth usage > 100MB ID: 4544 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 4545 - Posted: 20 Dec 2016, 10:56:29 UTC With a little help of "uncle squid" the first VM started up quickly and got 2 cmsRun jobs. If they need the same time to run than the non-dev jobs I expect them to finish in 100+ minutes. 2016-12-20 11:34:57 (20845): vboxwrapper (7.7.26196): starting 2016-12-20 11:34:58 (20845): Setting Memory Size for VM. (3072MB) 2016-12-20 11:34:58 (20845): Setting CPU Count for VM. (2) 2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80 2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded! 2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] 0 2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded! 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded! 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Testing connection to Condor server on port 9618 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded! 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0 2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Probing CVMFS ... 2016-12-20 11:36:06 (20845): Guest Log: Probing /cvmfs/grid.cern.ch... OK 2016-12-20 11:36:08 (20845): Guest Log: Probing /cvmfs/cms.cern.ch... OK 2016-12-20 11:36:13 (20845): Guest Log: [DEBUG] HTCondor ping 2016-12-20 11:36:13 (20845): Guest Log: [DEBUG] 0 2016-12-20 11:37:26 (20845): Guest Log: [INFO] Condor JobID: 2674 in slot1 2016-12-20 11:37:26 (20845): Guest Log: [INFO] Condor JobID: 2679 in slot2 ID: 4545 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 4549 - Posted: 20 Dec 2016, 13:25:38 UTC A quick look on the download and upload numbers. Downloads: Seem roughly the same compared to the jobs of the last version Uploads: The job results take 60-85 MB/job now Last version had 15-20 MB/job This will hurt users with slower internet connection and should be revised. ID: 4549 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1129 Credit: 339,231 RAC: 0	Message 4551 - Posted: 20 Dec 2016, 13:56:48 UTC - in response to Message 4549. A quick look on the download and upload numbers. Downloads: Seem roughly the same compared to the jobs of the last version Uploads: The job results take 60-85 MB/job now Last version had 15-20 MB/job This will hurt users with slower internet connection and should be revised. As long as you are not getting the merge jobs that download 1.5G of input data, join them and upload the output, Ivan can tune the other jobs to control the upload size. ID: 4551 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4552 - Posted: 20 Dec 2016, 15:24:29 UTC And the running.log is not filled with all the redundant text any more. >Good. ID: 4552 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4555 - Posted: 20 Dec 2016, 16:11:55 UTC - in response to Message 4551. A quick look on the download and upload numbers. Downloads: Seem roughly the same compared to the jobs of the last version Uploads: The job results take 60-85 MB/job now Last version had 15-20 MB/job This will hurt users with slower internet connection and should be revised. As long as you are not getting the merge jobs that download 1.5G of input data, join them and upload the output, Ivan can tune the other jobs to control the upload size. Yes, I'll certainly try to revise this before we go "production". It's a little bit harder for me to get the logs for the WMAgent jobs, and most of them are running on Laurence's "little" cluster rather than on -dev volunteer's machines -- he may have more modern systems than some of you! Over the next couple of weeks I'll be trying to find out a) how to run the workflow(s) directly on my servers, to better judge an optimum job size, and b) how to specify other workflows, similar to the lambda -> p + nu we've been running with CRAB3 the last several months. Expect changes as we go forward -- we'll try to keep them for the better! ID: 4555 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4556 - Posted: 20 Dec 2016, 16:12:47 UTC - in response to Message 4552. And the running.log is not filled with all the redundant text any more. >Good. I may try to dig out how they achieve this! ID: 4556 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1223 Credit: 942,613 RAC: 295	Message 4558 - Posted: 20 Dec 2016, 16:20:37 UTC - in response to Message 4552. And the running.log is not filled with all the redundant text any more. >Good. It seems that every Event is written now and not every 1,000th. A bit overdone, I think with probably 200,000 events with the current tasks. ID: 4558 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4559 - Posted: 20 Dec 2016, 16:23:42 UTC - in response to Message 4558. And the running.log is not filled with all the redundant text any more. >Good. It seems that every Event is written now and not every 1,000th. A bit overdone, I think with probably 200,000 events with the current tasks. Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system. ID: 4559 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1223 Credit: 942,613 RAC: 295	Message 4561 - Posted: 20 Dec 2016, 20:28:35 UTC - in response to Message 4559. And the running.log is not filled with all the redundant text any more. >Good. It seems that every Event is written now and not every 1,000th. A bit overdone, I think with probably 200,000 events with the current tasks. Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system. It seems these tasks have 4,000 events, thus not too much logging for now. ID: 4561 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4562 - Posted: 20 Dec 2016, 20:55:33 UTC - in response to Message 4561. And the running.log is not filled with all the redundant text any more. >Good. It seems that every Event is written now and not every 1,000th. A bit overdone, I think with probably 200,000 events with the current tasks. Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system. It seems these tasks have 4,000 events, thus not too much logging for now. Yes, I'm still getting to grips with how the numbers work here. It looks like these jobs have a programmed "filter efficiency" of 0.02715, compared to the 0.00003 or so of the older campaign, so out of 4000 events in a job, about 108 "useful" events will be returned, compared to ~6 beforehand. Thus the output files are bigger (but not proportionally since most data is returned in ROOT structures, histograms and TTrees). The JSON script I was given to work with asked for 3,000,000 events, so 3,000,000/108.6 = 27,625 total jobs. I'm finding the monitoring for this almost as frustrating as Dashboard, but I'm gradually figuring it out. First estimate is another two days to go in this batch, but most of the jobs will be processed by Laurence's cluster, and hopefully now all the merge jobs will be going to it to conserve your bandwidth. Do keep us informed of your experiences and opinions. ID: 4562 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4594 - Posted: 30 Dec 2016, 17:03:24 UTC Last modified: 30 Dec 2016, 17:19:29 UTC I have noticed, that sometimes jobs within the same task have the same Condor Job ID. Is it actually running the same job with 2 cores? These jobs run somewhat faster than jobs with different ids. 2016-12-29 23:26:56 (26050): Guest Log: [INFO] New Job Starting in slot2 2016-12-29 23:26:56 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot2 2016-12-29 23:27:00 (26050): Guest Log: [INFO] New Job Starting in slot1 2016-12-29 23:27:00 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot1 2016-12-29 23:31:10 (26050): Guest Log: [INFO] WMAgent_JobID = 108075 in slot1 2016-12-29 23:31:11 (26050): Guest Log: [INFO] WMAgent_JobID = 108076 in slot2 2016-12-29 23:46:40 (26050): Guest Log: [INFO] New Job Starting in slot4 2016-12-29 23:46:40 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot4 2016-12-29 23:46:43 (26050): Guest Log: [INFO] New Job Starting in slot3 2016-12-29 23:46:43 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot3 2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108156 in slot4 2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108155 in slot3 However, slots with the same Condor job id do not end at the same time. 2016-12-30 02:04:20 (26050): Guest Log: [INFO] Job finished in slot4 with unknown exit code. 2016-12-30 02:06:15 (26050): Guest Log: [INFO] New Job Starting in slot4 2016-12-30 02:06:15 (26050): Guest Log: [INFO] Condor JobID: 9374 in slot4 2016-12-30 02:06:20 (26050): Guest Log: [INFO] Job finished in slot1 with unknown exit code. 2016-12-30 02:06:30 (26050): Guest Log: [INFO] WMAgent_JobID = 108478 in slot4 2016-12-30 02:08:24 (26050): Guest Log: [INFO] New Job Starting in slot1 2016-12-30 02:08:24 (26050): Guest Log: [INFO] Condor JobID: 9375 in slot1 2016-12-30 02:08:33 (26050): Guest Log: [INFO] WMAgent_JobID = 108486 in slot1 2016-12-30 02:11:05 (26050): Guest Log: [INFO] Job finished in slot3 with unknown exit code. 2016-12-30 02:13:46 (26050): Guest Log: [INFO] Job finished in slot2 with unknown exit code. ID: 4594 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4595 - Posted: 31 Dec 2016, 12:50:49 UTC - in response to Message 4594. I have noticed, that sometimes jobs within the same task have the same Condor Job ID. Is it actually running the same job with 2 cores? These jobs run somewhat faster than jobs with different ids. 2016-12-29 23:26:56 (26050): Guest Log: [INFO] New Job Starting in slot2 2016-12-29 23:26:56 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot2 2016-12-29 23:27:00 (26050): Guest Log: [INFO] New Job Starting in slot1 2016-12-29 23:27:00 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot1 2016-12-29 23:31:10 (26050): Guest Log: [INFO] WMAgent_JobID = 108075 in slot1 2016-12-29 23:31:11 (26050): Guest Log: [INFO] WMAgent_JobID = 108076 in slot2 2016-12-29 23:46:40 (26050): Guest Log: [INFO] New Job Starting in slot4 2016-12-29 23:46:40 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot4 2016-12-29 23:46:43 (26050): Guest Log: [INFO] New Job Starting in slot3 2016-12-29 23:46:43 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot3 2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108156 in slot4 2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108155 in slot3 However, slots with the same Condor job id do not end at the same time. 2016-12-30 02:04:20 (26050): Guest Log: [INFO] Job finished in slot4 with unknown exit code. 2016-12-30 02:06:15 (26050): Guest Log: [INFO] New Job Starting in slot4 2016-12-30 02:06:15 (26050): Guest Log: [INFO] Condor JobID: 9374 in slot4 2016-12-30 02:06:20 (26050): Guest Log: [INFO] Job finished in slot1 with unknown exit code. 2016-12-30 02:06:30 (26050): Guest Log: [INFO] WMAgent_JobID = 108478 in slot4 2016-12-30 02:08:24 (26050): Guest Log: [INFO] New Job Starting in slot1 2016-12-30 02:08:24 (26050): Guest Log: [INFO] Condor JobID: 9375 in slot1 2016-12-30 02:08:33 (26050): Guest Log: [INFO] WMAgent_JobID = 108486 in slot1 2016-12-30 02:11:05 (26050): Guest Log: [INFO] Job finished in slot3 with unknown exit code. 2016-12-30 02:13:46 (26050): Guest Log: [INFO] Job finished in slot2 with unknown exit code. I think you are seeing Condor "subjobs" (without going searching, I don't recall the exact term). In our CRAB jobs there is one job per condor "group" (ditto), with a suffix .0 that isn't reported. With the WMAgent jobs, each "group" gas heveral "subjobs", e.g. here's an excerpt from condor_q on Laurence's Condor server: 10485.0 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.1 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.2 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.3 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.4 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.5 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.6 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.7 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.8 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.9 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. ID: 4595 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4596 - Posted: 31 Dec 2016, 13:06:17 UTC - in response to Message 4595. Last modified: 31 Dec 2016, 13:23:17 UTC That line appears to come from /cvmfs/grid.cern.ch/vc/bin/job-wrapper log_info "Condor JobID: $(grep ^ClusterId ${_CONDOR_SCRATCH_DIR}/.job.ad \| cut -d '=' -f2) in ${_CONDOR_SLOT}" >> /var/www/html/logs/stdout.log Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId". ID: 4596 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4597 - Posted: 31 Dec 2016, 18:42:22 UTC i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. Thanks, Ivan. How does this translate to jobs listed on dashboard? If there are up to 10 sub-jobs, what is shown? Or is it no longer really possible to view the outcome of a single job? ID: 4597 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4598 - Posted: 31 Dec 2016, 23:52:10 UTC - in response to Message 4597. i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. Thanks, Ivan. How does this translate to jobs listed on dashboard? If there are up to 10 sub-jobs, what is shown? Or is it no longer really possible to view the outcome of a single job? OK, so the proper terminology seems to be jobs and clusters, rather than subjobs and groups, so I'm fairly sure that what Dashboard reports as a job is a job. In the Dashboard interactive view, for the CRAB3 jobs it gives the familiar integer number for IDdinTask but for the WMAgent jobs that column has entries like 008371f2-cf0b-11e6-bebc-02163e018309-0_0. So it may be harder to translate to a JobID for tracing logs, but I'm sure Dashboard is reporting individual jobs, not integrated clusters. ID: 4598 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1129 Credit: 339,231 RAC: 0	Message 4600 - Posted: 2 Jan 2017, 21:01:57 UTC - in response to Message 4596. Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId". Appended the ProcId. If you would prefer the GlobalJobId, just let me know. ID: 4600 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1141 Credit: 8,310,612 RAC: 0	Message 4601 - Posted: 3 Jan 2017, 1:02:02 UTC - in response to Message 4600. Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId". Appended the ProcId. If you would prefer the GlobalJobId, just let me know. Cheers, Laurence. I think they're equivalent. Main thing is to prevent the sort of confusion that started this conversation. ID: 4601 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 4602 - Posted: 3 Jan 2017, 13:59:15 UTC Thanks, guys. At least i can tell them apart, now. ID: 4602 · Rating: 0 · rate: / Reply Quote

Development for LHC@home