Message boards : News : New jobs available
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 116
Message 1330 - Posted: 26 Oct 2015, 11:40:18 UTC

OK, the database fix appears to have been applied, I have been able to submit a new batch of jobs.
ID: 1330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 1331 - Posted: 26 Oct 2015, 13:34:43 UTC

Yep, done 2 already.
ID: 1331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 116
Message 1332 - Posted: 26 Oct 2015, 16:35:14 UTC - in response to Message 1331.  

Yep, done 2 already.

I make it 3 now. :-)
ID: 1332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 361
Message 1362 - Posted: 28 Oct 2015, 19:19:09 UTC - in response to Message 1328.  
Last modified: 28 Oct 2015, 19:20:22 UTC


From those success jobs:

28 needed 3 attempts to finish and 249 succeeded after 2 attemps.
IMO that number of extra attemps is excessive.

Has anybody carefully followed what happens to those jobs that are abandoned when the host is shutdown/rebooted or whatever? Could they show up like this:-

"N/A / Error return without specification"? Many initial failures do. Maybe Condor puts the "error" bit in because some timeout or heartbeat fails, there certainly wouldn't be a "specification".

They all require at least one retry. At the moment there will probably be a higher rate of these events because people are "poking around", looking for problems and suchlike. Many hosts (mine...) don't run continuously, that tends towards one extra attempt per host per day. IMO you can't reasonably require volunteers to run machines continuously, although I realise many do.

I've followed up two jobs from the 50 event task, 281 and 305. Both were
started but abandoned when the hosts were shut down. Dashboard shows my IP but
the (presumably) final hosts's start and stop times. I can't see any indication
on Dashboard that the jobs weren't run to completion on the hosts which
originally picked them up. No retries are shown. So, at least for successful
jobs, the high retry rate isn't due to these "abandoned" jobs - they show up as
normally successful jobs.
ID: 1362 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 320
Message 1409 - Posted: 2 Nov 2015, 9:54:41 UTC
Last modified: 2 Nov 2015, 10:06:36 UTC

From your today submitted batch 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3 the first 3 returned jobs failed with 8001 / CMS exception (CMSSW)
All 3 from the same IP and running (too) short to be true.

Edit: the 4th job finished with success error code.
ID: 1409 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 116
Message 1413 - Posted: 2 Nov 2015, 14:30:53 UTC - in response to Message 1409.  

From your today submitted batch 151102_084842:ireid_crab_CMS_at_Home_TTbar_50ev_3 the first 3 returned jobs failed with 8001 / CMS exception (CMSSW)
All 3 from the same IP and running (too) short to be true.

Edit: the 4th job finished with success error code.

I see three job logs with 8001 errors, 7, 17, and 23 and they do all come from the same machine. It has 13 errors in all today. I'll see what I can find.
ID: 1413 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1868 - Posted: 5 Feb 2016, 0:42:53 UTC

Some of our computers are re-assigned to another backfill batch of 200 jobs.
Looks like they are failing, again.
ID: 1868 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : News : New jobs available


©2024 CERN