Message boards : Number crunching : Expect errors eventually
Message board moderation
Previous · 1 . . . 9 · 10 · 11 · 12
| Author | Message |
|---|---|
ivanSend message Joined: 20 Jan 15 Posts: 1153 Credit: 8,310,612 RAC: 0 |
|
Laurence CERN![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0 |
This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server. |
Laurence CERN![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0 |
It just means that once the condor client in the VM has matched a job, it will keep getting jobs without having to be re-matched. Sort of like keeping a session open with connections. It makes things more efficient and as we are not doing complex scheduling with priorities and our VMs will only last for 12h, it should not have any negative effects. It should however avoid the claim expired error with suspend/resume. |
Laurence CERN![]() Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 0 |
Discussions like these should now be in the CMS application topic. |
|
Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 51 |
Maybe you should put a big note somewhere to stop this from happening again. |
ivanSend message Joined: 20 Jan 15 Posts: 1153 Credit: 8,310,612 RAC: 0 |
|
|
Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 51 |
This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server. Maybe not walltime lost, but the nearly all fails ran for 20min.This adds up to quite some wasted time. |
©2025 CERN