Message boards : CMS Application : Batch Progress
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3475 - Posted: 23 May 2016, 11:33:48 UTC

[cms005@lcggwms02:~] > cat stats.sh
#!/bin/bash
grep 'NodeStatus ' $1/node_state.txt|sort|uniq -c
Mon May 23 12:30:46

[cms005@lcggwms02:~] > ./stats.sh 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB
4173 NodeStatus = 1; /* "STATUS_READY" */
1075 NodeStatus = 3; /* "STATUS_SUBMITTED" */
4745 NodeStatus = 5; /* "STATUS_DONE" */
7 NodeStatus = 6; /* "STATUS_ERROR" */

ID: 3475 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3476 - Posted: 23 May 2016, 12:29:24 UTC - in response to Message 3475.  

Thanks for the info.
How can there be fewer "submitted" than "Ready" or "Done"?
ID: 3476 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3479 - Posted: 23 May 2016, 14:05:19 UTC - in response to Message 3476.  

Thanks for the info.
How can there be fewer "submitted" than "Ready" or "Done"?

When I send a batch - 10,000 in this case - all the jobs are "ready". Then, up to ~1,000 are moved into the queue and become "submitted". As jobs are taken up by processes, jobs move into the queue to replace them and thus go out of "ready" into "submitted"; I think that jobs which are re-queued for retry also are "submitted". I'm not sure what state running jobs are in, probably also submitted since the sum of "idle" and "running" is about the number submitted. And of course jobs which are successful get to "done", those which fail three tries (or other errors) go into "error". Note that all four categories add up to the total number in the batch.
ID: 3479 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3480 - Posted: 23 May 2016, 15:57:17 UTC - in response to Message 3479.  

Thanks for the explanation Ivan.
I thought, submitted means the same as in dashboard.
ID: 3480 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3481 - Posted: 23 May 2016, 17:58:08 UTC - in response to Message 3480.  

No, that's from Condor itself. Confusing terminology, I must admit...
ID: 3481 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3482 - Posted: 23 May 2016, 18:59:54 UTC

How is the proxy lease?
Does it not need a refresh. soon?
ID: 3482 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3483 - Posted: 23 May 2016, 20:56:54 UTC - in response to Message 3482.  

How is the proxy lease?
Does it not need a refresh. soon?

160518_203523, so due on the 25th. Current stats:
3776 NodeStatus = 1; /* "STATUS_READY" */
1084 NodeStatus = 3; /* "STATUS_SUBMITTED" */
1 NodeStatus = 4; /* "STATUS_POSTRUN" */
5132 NodeStatus = 5; /* "STATUS_DONE" */
7 NodeStatus = 6; /* "STATUS_ERROR" */

ID: 3483 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3484 - Posted: 23 May 2016, 20:59:06 UTC

Previous large batch:
[cms005@lcggwms02:~] > ./stats.sh 160509_200134:ireid_crab_CMS_at_Home_TTbar_50ev_prodA
9881 NodeStatus = 5; /* "STATUS_DONE" */
119 NodeStatus = 6; /* "STATUS_ERROR" */

ID: 3484 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3488 - Posted: 25 May 2016, 16:40:18 UTC

Currrent batch:
[cms005@lcggwms02:~] > ./stats.sh 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB
1771 NodeStatus = 1; /* "STATUS_READY" */
1095 NodeStatus = 3; /* "STATUS_SUBMITTED" */
7126 NodeStatus = 5; /* "STATUS_DONE" */
8 NodeStatus = 6; /* "STATUS_ERROR" */


Proxy renewed today. :-)
ID: 3488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3489 - Posted: 25 May 2016, 17:20:51 UTC

Thanks, Ivan!
ID: 3489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3504 - Posted: 27 May 2016, 15:30:33 UTC

[cms005@lcggwms02:~] > ./stats.sh 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB
617 NodeStatus = 3; /* "STATUS_SUBMITTED" */
9356 NodeStatus = 5; /* "STATUS_DONE" */
27 NodeStatus = 6; /* "STATUS_ERROR" */


I'll have to submit a new batch sometime over the (long) weekend.
ID: 3504 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3505 - Posted: 27 May 2016, 22:47:33 UTC

New batch (5,000 x 50 TTbar events) submitted. Let's see if Dashboard is fully operational again...
ID: 3505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3507 - Posted: 28 May 2016, 10:15:05 UTC - in response to Message 3504.  

Final status:
[cms005@lcggwms02:~] > ./stats.sh 160518_203523:ireid_crab_CMS_at_Home_TTbar_50ev_prodB
9968 NodeStatus = 5; /* "STATUS_DONE" */
32 NodeStatus = 6; /* "STATUS_ERROR" */

ID: 3507 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3508 - Posted: 28 May 2016, 12:02:50 UTC - in response to Message 3507.  

Congrats!
Best one, yet.

Only 0.32% ERROR.
ID: 3508 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3509 - Posted: 28 May 2016, 17:06:33 UTC - in response to Message 3508.  

Congrats!
Best one, yet.

Only 0.32% ERROR.

Looks like we lost a few in the system, though -- I only count 9,816 result files on the data-bridge. Some may yet turn up, but it's doubtful.
ID: 3509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3510 - Posted: 28 May 2016, 17:23:04 UTC

Looks like we lost a few in the system, though -- I only count 9,816 result files on the data-bridge. Some may yet turn up, but it's doubtful.


Does that include the error jobs?
ID: 3510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3511 - Posted: 28 May 2016, 18:36:06 UTC - in response to Message 3508.  

I prefer 99.68% SUCCESS. We should try to understand what happened to the 32 jobs that failed.
ID: 3511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3512 - Posted: 28 May 2016, 18:38:17 UTC - in response to Message 3509.  

If you can give me some examples of missing files I can investigate.
ID: 3512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3513 - Posted: 28 May 2016, 19:01:05 UTC - in response to Message 3510.  

Looks like we lost a few in the system, though -- I only count 9,816 result files on the data-bridge. Some may yet turn up, but it's doubtful.


Does that include the error jobs?

Yes, I presume they didn't make it all the way through for some reason or another.
ID: 3513 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3514 - Posted: 28 May 2016, 19:02:23 UTC - in response to Message 3512.  
Last modified: 28 May 2016, 19:02:57 UTC

If you can give me some examples of missing files I can investigate.

I'll see what I can dig out, Laurence. It may be a bit messy, but that's what bash, awk and python are for...
ID: 3514 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : Batch Progress


©2024 CERN