Message boards : CMS Application : Batch Progress
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3515 - Posted: 28 May 2016, 19:10:35 UTC
Last modified: 28 May 2016, 19:12:48 UTC

I prefer 99.68% SUCCESS. We should try to understand what happened to the 32 jobs that failed.


It is actually 184 jobs missing, therefore more fails may be possible.
ID: 3515 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3516 - Posted: 28 May 2016, 19:17:37 UTC - in response to Message 3509.  

Looks like I can't use a calculator:
[homepc01:Downloads] > grep root 00*.html|wc
9811 127543 3214777
ID: 3516 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3518 - Posted: 28 May 2016, 19:48:04 UTC - in response to Message 3514.  

If you can give me some examples of missing files I can investigate.

I'll see what I can dig out, Laurence. It may be a bit messy, but that's what bash, awk and python are for...

[homepc01:Downloads] > cat jobs.awk
{ split($0,a,"_"); split(a[2],b,"."); print b[1];}
[homepc01:Downloads] > cat jobs2.awk
BEGIN {i=1;}
{ while (i != ($1+0)) {print i; i=i+1;}
i=i+1;}
[homepc01:Downloads] > grep root 00*.html|gawk -f jobs.awk|sort -n|gawk -f jobs2.awk>missing.jobs
[homepc01:Downloads] > wc missing.jobs
189 189 916 missing.jobs

So that's the list of missing jobs on the data-bridge. Now to get the list of Condor failures...
ID: 3518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3519 - Posted: 28 May 2016, 19:56:20 UTC - in response to Message 3518.  

Ah, that was easy. :-)
[cms005@lcggwms02:~] > grep -B 1 'NodeStatus = 6' 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB/node_state.txt|grep Job>condor.fails
[cms005@lcggwms02:~] > [color=red]wc condor.fails
32 96 620 condor.fails

Now a bit of hand editing...
ID: 3519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3520 - Posted: 28 May 2016, 20:09:23 UTC - in response to Message 3519.  

So, here's the list of jobs Condor thought succeeded, but whose output isn't on the data-bridge:

11 209 300 413 475 498 677 824 1205 1313 1358 1421 1485 1493 1652 1970
1977 2000 2280 2297 2397 2410 2509 2535 2588 2618 2637 2703 2824 2838
2898 2943 2960 3109 3185 3232 3252 3295 3462 3688 3725 3776 3889 3971
4214 4285 4782 5212 5355 5362 5487 5732 5761 5799 5870 5937 5962 5973
5989 5999 6030 6039 6050 6055 6056 6061 6066 6068 6078 6079 6084 6085
6086 6088 6090 6095 6096 6116 6119 6126 6132 6137 6274 6281 6388 6609
6659 6660 6801 6839 6899 6985 7126 7208 7216 7295 7434 7455 7495 7506
7583 7615 7805 8072 8131 8162 8212 8502 8513 8571 8611 8613 8620 8708
8729 8784 8786 8820 8867 8877 8885 8961 8997 8999 9029 9077 9114 9190
9191 9193 9222 9292 9317 9334 9370 9403 9406 9412 9429 9430 9444 9453
9462 9468 9489 9632 9665 9682 9694 9697 9701 9774 9847 9849 9882 9899
9958 9967 9985

In addition, Condor thinks these two jobs failed, but their output is on the DB:

9927 9964
ID: 3520 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3537 - Posted: 4 Jun 2016, 20:39:04 UTC
Last modified: 4 Jun 2016, 20:49:49 UTC

Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately.

It might cause Dashboard to report errors, too -- we seem to be starting an uptick in errors on the graphs.
ID: 3537 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3538 - Posted: 5 Jun 2016, 12:39:26 UTC - in response to Message 3537.  
Last modified: 5 Jun 2016, 12:39:38 UTC

Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately.

It might cause Dashboard to report errors, too -- we seem to be starting an uptick in errors on the graphs.

In the meantime, CERN IT are deploying an alternative server.
ID: 3538 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 3539 - Posted: 5 Jun 2016, 16:44:50 UTC - in response to Message 3537.  
Last modified: 5 Jun 2016, 16:45:12 UTC

Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately.

I'm pleased to report that either the server has been fixed, or a replacement configured. Jobs are available again.
ID: 3539 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : Batch Progress


©2024 CERN