Message boards :
CMS Application :
Batch Progress
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I prefer 99.68% SUCCESS. We should try to understand what happened to the 32 jobs that failed. It is actually 184 jobs missing, therefore more fails may be possible. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Looks like I can't use a calculator: [homepc01:Downloads] > grep root 00*.html|wc 9811 127543 3214777 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
If you can give me some examples of missing files I can investigate. [homepc01:Downloads] > cat jobs.awk { split($0,a,"_"); split(a[2],b,"."); print b[1];} [homepc01:Downloads] > cat jobs2.awk BEGIN {i=1;} { while (i != ($1+0)) {print i; i=i+1;} i=i+1;} [homepc01:Downloads] > grep root 00*.html|gawk -f jobs.awk|sort -n|gawk -f jobs2.awk>missing.jobs [homepc01:Downloads] > wc missing.jobs 189 189 916 missing.jobs So that's the list of missing jobs on the data-bridge. Now to get the list of Condor failures... |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Ah, that was easy. :-) [cms005@lcggwms02:~] > grep -B 1 'NodeStatus = 6' 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB/node_state.txt|grep Job>condor.fails [cms005@lcggwms02:~] > [color=red]wc condor.fails 32 96 620 condor.fails Now a bit of hand editing... |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
So, here's the list of jobs Condor thought succeeded, but whose output isn't on the data-bridge: 11 209 300 413 475 498 677 824 1205 1313 1358 1421 1485 1493 1652 1970 1977 2000 2280 2297 2397 2410 2509 2535 2588 2618 2637 2703 2824 2838 2898 2943 2960 3109 3185 3232 3252 3295 3462 3688 3725 3776 3889 3971 4214 4285 4782 5212 5355 5362 5487 5732 5761 5799 5870 5937 5962 5973 5989 5999 6030 6039 6050 6055 6056 6061 6066 6068 6078 6079 6084 6085 6086 6088 6090 6095 6096 6116 6119 6126 6132 6137 6274 6281 6388 6609 6659 6660 6801 6839 6899 6985 7126 7208 7216 7295 7434 7455 7495 7506 7583 7615 7805 8072 8131 8162 8212 8502 8513 8571 8611 8613 8620 8708 8729 8784 8786 8820 8867 8877 8885 8961 8997 8999 9029 9077 9114 9190 9191 9193 9222 9292 9317 9334 9370 9403 9406 9412 9429 9430 9444 9453 9462 9468 9489 9632 9665 9682 9694 9697 9701 9774 9847 9849 9882 9899 9958 9967 9985 In addition, Condor thinks these two jobs failed, but their output is on the DB: 9927 9964 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately. It might cause Dashboard to report errors, too -- we seem to be starting an uptick in errors on the graphs. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately. In the meantime, CERN IT are deploying an alternative server. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Seems a server problem has just arisen at CERN, as I was about to submit a new batch of CRAB jobs. It's being worked on but no estimate of how long it will take. We'll be short of jobs until it's fixed, unfortunately. I'm pleased to report that either the server has been fixed, or a replacement configured. Jobs are available again. |
©2024 CERN