Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 9 · 10 · 11 · 12

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2440 - Posted: 17 Mar 2016, 20:40:06 UTC - in response to Message 2439.  

That's getting beyond my pay-grade, I'm afraid! :-)
ID: 2440 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2441 - Posted: 17 Mar 2016, 20:44:15 UTC - in response to Message 2435.  

This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server.
ID: 2441 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2442 - Posted: 17 Mar 2016, 20:53:05 UTC - in response to Message 2439.  

It just means that once the condor client in the VM has matched a job, it will keep getting jobs without having to be re-matched. Sort of like keeping a session open with connections. It makes things more efficient and as we are not doing complex scheduling with priorities and our VMs will only last for 12h, it should not have any negative effects. It should however avoid the claim expired error with suspend/resume.
ID: 2442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 129
Message 2443 - Posted: 17 Mar 2016, 20:53:42 UTC - in response to Message 2442.  

Discussions like these should now be in the CMS application topic.
ID: 2443 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2444 - Posted: 17 Mar 2016, 21:28:16 UTC

Maybe you should put a big note somewhere to stop this from happening again.
ID: 2444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 2445 - Posted: 17 Mar 2016, 21:48:59 UTC - in response to Message 2444.  

Maybe you should put a big note somewhere to stop this from happening again.

+1
I'll ask RAL.
ID: 2445 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2446 - Posted: 17 Mar 2016, 22:06:54 UTC - in response to Message 2441.  

This makes sense. You can see that although there are many job failures, this does not translate into walltime lost. Those jobs are probably failing in the server so not even sent to the VM. The walltime plot shows a few failures, probably because the finished job could not contact the server.


Maybe not walltime lost, but the nearly all fails ran for 20min.This adds up to quite some wasted time.
ID: 2446 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 9 · 10 · 11 · 12

Message boards : Number crunching : Expect errors eventually


©2024 CERN