Thread 'Expect errors eventually'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2440 - Posted: 17 Mar 2016, 20:40:06 UTC - in response to Message 2439. That's getting beyond my pay-grade, I'm afraid! :-) ID: 2440 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2441 - Posted: 17 Mar 2016, 20:44:15 UTC - in response to Message 2435. This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server. ID: 2441 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2442 - Posted: 17 Mar 2016, 20:53:05 UTC - in response to Message 2439. It just means that once the condor client in the VM has matched a job, it will keep getting jobs without having to be re-matched. Sort of like keeping a session open with connections. It makes things more efficient and as we are not doing complex scheduling with priorities and our VMs will only last for 12h, it should not have any negative effects. It should however avoid the claim expired error with suspend/resume. ID: 2442 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 2443 - Posted: 17 Mar 2016, 20:53:42 UTC - in response to Message 2442. Discussions like these should now be in the CMS application topic. ID: 2443 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2444 - Posted: 17 Mar 2016, 21:28:16 UTC Maybe you should put a big note somewhere to stop this from happening again. ID: 2444 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2445 - Posted: 17 Mar 2016, 21:48:59 UTC - in response to Message 2444. Maybe you should put a big note somewhere to stop this from happening again. +1 I'll ask RAL. ID: 2445 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2446 - Posted: 17 Mar 2016, 22:06:54 UTC - in response to Message 2441. This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server. Maybe not walltime lost, but the nearly all fails ran for 20min.This adds up to quite some wasted time. ID: 2446 · Rating: 0 · rate: / Reply Quote

Development for LHC@home