Message boards : CMS Application : New Version v47.90
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,885
RAC: 273
Message 4544 - Posted: 20 Dec 2016, 10:11:24 UTC

This new version should exclude the Merge jobs. Please report any jobs that have higher than normal bandwidth usage > 100MB
ID: 4544 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 555
Message 4545 - Posted: 20 Dec 2016, 10:56:29 UTC

With a little help of "uncle squid" the first VM started up quickly and got 2 cmsRun jobs.
If they need the same time to run than the non-dev jobs I expect them to finish in 100+ minutes.

2016-12-20 11:34:57 (20845): vboxwrapper (7.7.26196): starting
2016-12-20 11:34:58 (20845): Setting Memory Size for VM. (3072MB)
2016-12-20 11:34:58 (20845): Setting CPU Count for VM. (2)
2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] 0
2016-12-20 11:36:04 (20845): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded!
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] 0
2016-12-20 11:36:06 (20845): Guest Log: [DEBUG] Probing CVMFS ...
2016-12-20 11:36:06 (20845): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2016-12-20 11:36:08 (20845): Guest Log: Probing /cvmfs/cms.cern.ch... OK
2016-12-20 11:36:13 (20845): Guest Log: [DEBUG] HTCondor ping
2016-12-20 11:36:13 (20845): Guest Log: [DEBUG] 0
2016-12-20 11:37:26 (20845): Guest Log: [INFO] Condor JobID: 2674 in slot1
2016-12-20 11:37:26 (20845): Guest Log: [INFO] Condor JobID: 2679 in slot2
ID: 4545 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 555
Message 4549 - Posted: 20 Dec 2016, 13:25:38 UTC

A quick look on the download and upload numbers.

Downloads:
Seem roughly the same compared to the jobs of the last version

Uploads:
The job results take 60-85 MB/job now
Last version had 15-20 MB/job
This will hurt users with slower internet connection and should be revised.
ID: 4549 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,885
RAC: 273
Message 4551 - Posted: 20 Dec 2016, 13:56:48 UTC - in response to Message 4549.  

A quick look on the download and upload numbers.

Downloads:
Seem roughly the same compared to the jobs of the last version

Uploads:
The job results take 60-85 MB/job now
Last version had 15-20 MB/job
This will hurt users with slower internet connection and should be revised.


As long as you are not getting the merge jobs that download 1.5G of input data, join them and upload the output, Ivan can tune the other jobs to control the upload size.
ID: 4551 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4552 - Posted: 20 Dec 2016, 15:24:29 UTC

And the running.log is not filled with all the redundant text any more.

>Good.
ID: 4552 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4555 - Posted: 20 Dec 2016, 16:11:55 UTC - in response to Message 4551.  

A quick look on the download and upload numbers.

Downloads:
Seem roughly the same compared to the jobs of the last version

Uploads:
The job results take 60-85 MB/job now
Last version had 15-20 MB/job
This will hurt users with slower internet connection and should be revised.


As long as you are not getting the merge jobs that download 1.5G of input data, join them and upload the output, Ivan can tune the other jobs to control the upload size.

Yes, I'll certainly try to revise this before we go "production". It's a little bit harder for me to get the logs for the WMAgent jobs, and most of them are running on Laurence's "little" cluster rather than on -dev volunteer's machines -- he may have more modern systems than some of you! Over the next couple of weeks I'll be trying to find out a) how to run the workflow(s) directly on my servers, to better judge an optimum job size, and b) how to specify other workflows, similar to the lambda -> p + nu we've been running with CRAB3 the last several months.
Expect changes as we go forward -- we'll try to keep them for the better!
ID: 4555 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4556 - Posted: 20 Dec 2016, 16:12:47 UTC - in response to Message 4552.  

And the running.log is not filled with all the redundant text any more.

>Good.

I may try to dig out how they achieve this!
ID: 4556 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 4558 - Posted: 20 Dec 2016, 16:20:37 UTC - in response to Message 4552.  

And the running.log is not filled with all the redundant text any more.

>Good.

It seems that every Event is written now and not every 1,000th.
A bit overdone, I think with probably 200,000 events with the current tasks.
ID: 4558 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4559 - Posted: 20 Dec 2016, 16:23:42 UTC - in response to Message 4558.  

And the running.log is not filled with all the redundant text any more.

>Good.

It seems that every Event is written now and not every 1,000th.
A bit overdone, I think with probably 200,000 events with the current tasks.

Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system.
ID: 4559 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 4561 - Posted: 20 Dec 2016, 20:28:35 UTC - in response to Message 4559.  

And the running.log is not filled with all the redundant text any more.

>Good.

It seems that every Event is written now and not every 1,000th.
A bit overdone, I think with probably 200,000 events with the current tasks.

Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system.

It seems these tasks have 4,000 events, thus not too much logging for now.
ID: 4561 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4562 - Posted: 20 Dec 2016, 20:55:33 UTC - in response to Message 4561.  

And the running.log is not filled with all the redundant text any more.

>Good.

It seems that every Event is written now and not every 1,000th.
A bit overdone, I think with probably 200,000 events with the current tasks.

Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system.

It seems these tasks have 4,000 events, thus not too much logging for now.

Yes, I'm still getting to grips with how the numbers work here. It looks like these jobs have a programmed "filter efficiency" of 0.02715, compared to the 0.00003 or so of the older campaign, so out of 4000 events in a job, about 108 "useful" events will be returned, compared to ~6 beforehand. Thus the output files are bigger (but not proportionally since most data is returned in ROOT structures, histograms and TTrees). The JSON script I was given to work with asked for 3,000,000 events, so 3,000,000/108.6 = 27,625 total jobs.
I'm finding the monitoring for this almost as frustrating as Dashboard, but I'm gradually figuring it out. First estimate is another two days to go in this batch, but most of the jobs will be processed by Laurence's cluster, and hopefully now all the merge jobs will be going to it to conserve your bandwidth.

Do keep us informed of your experiences and opinions.
ID: 4562 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4594 - Posted: 30 Dec 2016, 17:03:24 UTC
Last modified: 30 Dec 2016, 17:19:29 UTC

I have noticed, that sometimes jobs within the same task have the same Condor Job ID.
Is it actually running the same job with 2 cores?
These jobs run somewhat faster than jobs with different ids.

2016-12-29 23:26:56 (26050): Guest Log: [INFO] New Job Starting in slot2
2016-12-29 23:26:56 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot2
2016-12-29 23:27:00 (26050): Guest Log: [INFO] New Job Starting in slot1
2016-12-29 23:27:00 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot1
2016-12-29 23:31:10 (26050): Guest Log: [INFO] WMAgent_JobID = 108075 in slot1
2016-12-29 23:31:11 (26050): Guest Log: [INFO] WMAgent_JobID = 108076 in slot2
2016-12-29 23:46:40 (26050): Guest Log: [INFO] New Job Starting in slot4
2016-12-29 23:46:40 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot4
2016-12-29 23:46:43 (26050): Guest Log: [INFO] New Job Starting in slot3
2016-12-29 23:46:43 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot3
2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108156 in slot4
2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108155 in slot3



However, slots with the same Condor job id do not end at the same time.

2016-12-30 02:04:20 (26050): Guest Log: [INFO] Job finished in slot4 with unknown exit code.
2016-12-30 02:06:15 (26050): Guest Log: [INFO] New Job Starting in slot4
2016-12-30 02:06:15 (26050): Guest Log: [INFO] Condor JobID: 9374 in slot4
2016-12-30 02:06:20 (26050): Guest Log: [INFO] Job finished in slot1 with unknown exit code.
2016-12-30 02:06:30 (26050): Guest Log: [INFO] WMAgent_JobID = 108478 in slot4
2016-12-30 02:08:24 (26050): Guest Log: [INFO] New Job Starting in slot1
2016-12-30 02:08:24 (26050): Guest Log: [INFO] Condor JobID: 9375 in slot1
2016-12-30 02:08:33 (26050): Guest Log: [INFO] WMAgent_JobID = 108486 in slot1
2016-12-30 02:11:05 (26050): Guest Log: [INFO] Job finished in slot3 with unknown exit code.
2016-12-30 02:13:46 (26050): Guest Log: [INFO] Job finished in slot2 with unknown exit code.
ID: 4594 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4595 - Posted: 31 Dec 2016, 12:50:49 UTC - in response to Message 4594.  

I have noticed, that sometimes jobs within the same task have the same Condor Job ID.
Is it actually running the same job with 2 cores?
These jobs run somewhat faster than jobs with different ids.

2016-12-29 23:26:56 (26050): Guest Log: [INFO] New Job Starting in slot2
2016-12-29 23:26:56 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot2
2016-12-29 23:27:00 (26050): Guest Log: [INFO] New Job Starting in slot1
2016-12-29 23:27:00 (26050): Guest Log: [INFO] Condor JobID: 9298 in slot1
2016-12-29 23:31:10 (26050): Guest Log: [INFO] WMAgent_JobID = 108075 in slot1
2016-12-29 23:31:11 (26050): Guest Log: [INFO] WMAgent_JobID = 108076 in slot2
2016-12-29 23:46:40 (26050): Guest Log: [INFO] New Job Starting in slot4
2016-12-29 23:46:40 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot4
2016-12-29 23:46:43 (26050): Guest Log: [INFO] New Job Starting in slot3
2016-12-29 23:46:43 (26050): Guest Log: [INFO] Condor JobID: 9310 in slot3
2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108156 in slot4
2016-12-29 23:46:52 (26050): Guest Log: [INFO] WMAgent_JobID = 108155 in slot3



However, slots with the same Condor job id do not end at the same time.

2016-12-30 02:04:20 (26050): Guest Log: [INFO] Job finished in slot4 with unknown exit code.
2016-12-30 02:06:15 (26050): Guest Log: [INFO] New Job Starting in slot4
2016-12-30 02:06:15 (26050): Guest Log: [INFO] Condor JobID: 9374 in slot4
2016-12-30 02:06:20 (26050): Guest Log: [INFO] Job finished in slot1 with unknown exit code.
2016-12-30 02:06:30 (26050): Guest Log: [INFO] WMAgent_JobID = 108478 in slot4
2016-12-30 02:08:24 (26050): Guest Log: [INFO] New Job Starting in slot1
2016-12-30 02:08:24 (26050): Guest Log: [INFO] Condor JobID: 9375 in slot1
2016-12-30 02:08:33 (26050): Guest Log: [INFO] WMAgent_JobID = 108486 in slot1
2016-12-30 02:11:05 (26050): Guest Log: [INFO] Job finished in slot3 with unknown exit code.
2016-12-30 02:13:46 (26050): Guest Log: [INFO] Job finished in slot2 with unknown exit code.

I think you are seeing Condor "subjobs" (without going searching, I don't recall the exact term). In our CRAB jobs there is one job per condor "group" (ditto), with a suffix .0 that isn't reported. With the WMAgent jobs, each "group" gas heveral "subjobs", e.g. here's an excerpt from condor_q on Laurence's Condor server:

10485.0 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.1 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.2 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.3 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.4 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.5 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.6 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.7 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.8 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo
10485.9 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo

i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information.
ID: 4595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4596 - Posted: 31 Dec 2016, 13:06:17 UTC - in response to Message 4595.  
Last modified: 31 Dec 2016, 13:23:17 UTC

That line appears to come from /cvmfs/grid.cern.ch/vc/bin/job-wrapper
log_info "Condor JobID: $(grep ^ClusterId ${_CONDOR_SCRATCH_DIR}/.job.ad | cut -d '=' -f2) in ${_CONDOR_SLOT}" >> /var/www/html/logs/stdout.log

Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId".
ID: 4596 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4597 - Posted: 31 Dec 2016, 18:42:22 UTC

i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information.



Thanks, Ivan.

How does this translate to jobs listed on dashboard?

If there are up to 10 sub-jobs, what is shown? Or is it no longer really possible to view the outcome of a single job?
ID: 4597 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4598 - Posted: 31 Dec 2016, 23:52:10 UTC - in response to Message 4597.  

i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information.

Thanks, Ivan.

How does this translate to jobs listed on dashboard?

If there are up to 10 sub-jobs, what is shown? Or is it no longer really possible to view the outcome of a single job?

OK, so the proper terminology seems to be jobs and clusters, rather than subjobs and groups, so I'm fairly sure that what Dashboard reports as a job is a job.
In the Dashboard interactive view, for the CRAB3 jobs it gives the familiar integer number for IDdinTask but for the WMAgent jobs that column has entries like 008371f2-cf0b-11e6-bebc-02163e018309-0_0. So it may be harder to translate to a JobID for tracing logs, but I'm sure Dashboard is reporting individual jobs, not integrated clusters.
ID: 4598 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,885
RAC: 273
Message 4600 - Posted: 2 Jan 2017, 21:01:57 UTC - in response to Message 4596.  


Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId".


Appended the ProcId. If you would prefer the GlobalJobId, just let me know.
ID: 4600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4601 - Posted: 3 Jan 2017, 1:02:02 UTC - in response to Message 4600.  


Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId".


Appended the ProcId. If you would prefer the GlobalJobId, just let me know.

Cheers, Laurence. I think they're equivalent. Main thing is to prevent the sort of confusion that started this conversation.
ID: 4601 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4602 - Posted: 3 Jan 2017, 13:59:15 UTC

Thanks, guys.
At least i can tell them apart, now.
ID: 4602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : New Version v47.90


©2024 CERN