Message boards :
CMS Application :
New Version v47.90
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This new version should exclude the Merge jobs. Please report any jobs that have higher than normal bandwidth usage > 100MB |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
With a little help of "uncle squid" the first VM started up quickly and got 2 cmsRun jobs. If they need the same time to run than the non-dev jobs I expect them to finish in 100+ minutes. 2016-12-20 11:34:57 (20845): vboxwrapper (7.7.26196): starting |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
A quick look on the download and upload numbers. Downloads: Seem roughly the same compared to the jobs of the last version Uploads: The job results take 60-85 MB/job now Last version had 15-20 MB/job This will hurt users with slower internet connection and should be revised. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
A quick look on the download and upload numbers. As long as you are not getting the merge jobs that download 1.5G of input data, join them and upload the output, Ivan can tune the other jobs to control the upload size. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
And the running.log is not filled with all the redundant text any more. >Good. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
A quick look on the download and upload numbers. Yes, I'll certainly try to revise this before we go "production". It's a little bit harder for me to get the logs for the WMAgent jobs, and most of them are running on Laurence's "little" cluster rather than on -dev volunteer's machines -- he may have more modern systems than some of you! Over the next couple of weeks I'll be trying to find out a) how to run the workflow(s) directly on my servers, to better judge an optimum job size, and b) how to specify other workflows, similar to the lambda -> p + nu we've been running with CRAB3 the last several months. Expect changes as we go forward -- we'll try to keep them for the better! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
And the running.log is not filled with all the redundant text any more. I may try to dig out how they achieve this! |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
And the running.log is not filled with all the redundant text any more. It seems that every Event is written now and not every 1,000th. A bit overdone, I think with probably 200,000 events with the current tasks. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
And the running.log is not filled with all the redundant text any more. Yes, I changed that myself in the lambda workflow; I'm digging into where to specify it with the new system. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
And the running.log is not filled with all the redundant text any more. It seems these tasks have 4,000 events, thus not too much logging for now. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
And the running.log is not filled with all the redundant text any more. Yes, I'm still getting to grips with how the numbers work here. It looks like these jobs have a programmed "filter efficiency" of 0.02715, compared to the 0.00003 or so of the older campaign, so out of 4000 events in a job, about 108 "useful" events will be returned, compared to ~6 beforehand. Thus the output files are bigger (but not proportionally since most data is returned in ROOT structures, histograms and TTrees). The JSON script I was given to work with asked for 3,000,000 events, so 3,000,000/108.6 = 27,625 total jobs. I'm finding the monitoring for this almost as frustrating as Dashboard, but I'm gradually figuring it out. First estimate is another two days to go in this batch, but most of the jobs will be processed by Laurence's cluster, and hopefully now all the merge jobs will be going to it to conserve your bandwidth. Do keep us informed of your experiences and opinions. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have noticed, that sometimes jobs within the same task have the same Condor Job ID. Is it actually running the same job with 2 cores? These jobs run somewhat faster than jobs with different ids. 2016-12-29 23:26:56 (26050): Guest Log: [INFO] New Job Starting in slot2 However, slots with the same Condor job id do not end at the same time. 2016-12-30 02:04:20 (26050): Guest Log: [INFO] Job finished in slot4 with unknown exit code. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
I have noticed, that sometimes jobs within the same task have the same Condor Job ID. I think you are seeing Condor "subjobs" (without going searching, I don't recall the exact term). In our CRAB jobs there is one job per condor "group" (ditto), with a suffix .0 that isn't reported. With the WMAgent jobs, each "group" gas heveral "subjobs", e.g. here's an excerpt from condor_q on Laurence's Condor server: 10485.0 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.1 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.2 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.3 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.4 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.5 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.6 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.7 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.8 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo 10485.9 cmst1 12/31 12:36 0+00:00:00 I 70000 0.0 submit.sh ireid_Mo i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
That line appears to come from /cvmfs/grid.cern.ch/vc/bin/job-wrapper log_info "Condor JobID: $(grep ^ClusterId ${_CONDOR_SCRATCH_DIR}/.job.ad | cut -d '=' -f2) in ${_CONDOR_SLOT}" >> /var/www/html/logs/stdout.log Perhaps Laurence could modify it to add the extra suffix, it looks like it's available in the .job.ad as "ProcID" or the full thing in "GlobalJobId". |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. Thanks, Ivan. How does this translate to jobs listed on dashboard? If there are up to 10 sub-jobs, what is shown? Or is it no longer really possible to view the outcome of a single job? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
i.e. there are 10 "subjobs" in each "group", .0 to .9 but we don't show this extra information. OK, so the proper terminology seems to be jobs and clusters, rather than subjobs and groups, so I'm fairly sure that what Dashboard reports as a job is a job. In the Dashboard interactive view, for the CRAB3 jobs it gives the familiar integer number for IDdinTask but for the WMAgent jobs that column has entries like 008371f2-cf0b-11e6-bebc-02163e018309-0_0. So it may be harder to translate to a JobID for tracing logs, but I'm sure Dashboard is reporting individual jobs, not integrated clusters. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Appended the ProcId. If you would prefer the GlobalJobId, just let me know. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Cheers, Laurence. I think they're equivalent. Main thing is to prevent the sort of confusion that started this conversation. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, guys. At least i can tell them apart, now. |
©2024 CERN