Message boards : Number crunching : Expect errors eventually
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 12 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1340 - Posted: 26 Oct 2015, 22:11:36 UTC

No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful.
ID: 1340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1341 - Posted: 27 Oct 2015, 10:11:24 UTC - in response to Message 1340.  

Thanks for the warning. Could this explain the unexplained "validate errors" that
are causing some disquiet over there, one wonders. Atlas themselves don't say
much.
ID: 1341 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1342 - Posted: 27 Oct 2015, 10:13:29 UTC - in response to Message 1341.  

Thanks for the warning. Could this explain the unexplained "validate errors" that
are causing some disquiet over there, one wonders. Atlas themselves don't say
much.
No, I can't comment. I know next-to-nothing about the Atlas project.
ID: 1342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 1343 - Posted: 27 Oct 2015, 10:26:34 UTC - in response to Message 1340.  

No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful.

Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch.

Is the result file about 66MB and transferred to you without compression?
No problem here - was transferred within ~32 seconds.
ID: 1343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1345 - Posted: 27 Oct 2015, 12:29:34 UTC

if necessary, you can make it selectable for the user if he wants to process only small or even larger Jobs
ID: 1345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1347 - Posted: 27 Oct 2015, 14:59:14 UTC - in response to Message 1343.  

Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch.

Is the result file about 66MB and transferred to you without compression?
No problem here - was transferred within ~32 seconds.

Yes to the first; not sure about the second, I've not seen anything explicit but it may be a default (not much point in compressing though, the ROOT files are already compressed).
ID: 1347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1348 - Posted: 27 Oct 2015, 15:12:36 UTC

Is there a chance to make it possible to use 2 or more cores?
I have tried trough the app_info.xml, but it really only uses one, even though the virtual box has two assigned.
ID: 1348 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1349 - Posted: 27 Oct 2015, 16:31:13 UTC

I had a very long delay between jobs.
One hour between the end of event 50 and event 1 of the next job.

Normal is about 20 minutes.

Which log might contain any info on that? All console windows did not show anything unusual.
ID: 1349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1350 - Posted: 27 Oct 2015, 21:27:15 UTC - in response to Message 1349.  
Last modified: 27 Oct 2015, 21:28:48 UTC

I had a very long delay between jobs.
One hour between the end of event 50 and event 1 of the next job.

Normal is about 20 minutes.

Which log might contain any info on that? All console windows did not show anything unusual.

There's a glide-in log, isn't there? I can't run jobs at home, due to my poor ADSL, so I have to rely on memory. Check to see if it was within one glide-in instance (which calls several successive jobs) or between glide-ins. The logs I get back are some combination of Condor and application logs, on a per-job basis so I don't think they have much inter-job information. We seem to be running smoothly at the moment, so it might have been a network blockage somewhere.
ID: 1350 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1351 - Posted: 27 Oct 2015, 22:01:52 UTC - in response to Message 1350.  

glidein download started 6 min before the 1 event of the following run.
This would explain 6 of the extra 40 min it took to start the next job.

If there was a network outage of some kind, would that not be logging the retries?
ID: 1351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1352 - Posted: 27 Oct 2015, 22:53:42 UTC - in response to Message 1351.  

If there was a network outage of some kind, would that not be logging the retries?
No idea, really, I'm not a network expert. ...but I have seen connections hang for some time and come back up when the physical link is restored. As I recall, Linux is(was) far more resilient than Windows in this.
ID: 1352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1353 - Posted: 27 Oct 2015, 23:10:11 UTC - in response to Message 1352.  

The delay happened before 17.12h .
If it helps, look at the bold section(or show it to someone).
I will keep an eye on it to see, if this happens again.
It is not a serious issue, as it keeps running, but if you( or someone)has the time...


Run-1 master.log

10/27/15 11:10:05 (pid:7632) ******************************************************
10/27/15 11:10:05 (pid:7632) ** condor_master (CONDOR_MASTER) STARTING UP
10/27/15 11:10:05 (pid:7632) ** /home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_master
10/27/15 11:10:05 (pid:7632) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/27/15 11:10:05 (pid:7632) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/27/15 11:10:05 (pid:7632) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
10/27/15 11:10:05 (pid:7632) ** $CondorPlatform: x86_64_RedHat5 $
10/27/15 11:10:05 (pid:7632) ** PID = 7632
10/27/15 11:10:05 (pid:7632) ** Log last touched time unavailable (No such file or directory)
10/27/15 11:10:05 (pid:7632) ******************************************************
10/27/15 11:10:05 (pid:7632) Using config source: /home/boinc/CMSRun/glide_6VklMu/condor_config
10/27/15 11:10:05 (pid:7632) config Macros = 213, Sorted = 213, StringBytes = 10574, TablesBytes = 7708
10/27/15 11:10:05 (pid:7632) CLASSAD_CACHING is OFF
10/27/15 11:10:05 (pid:7632) Daemon Log is logging: D_ALWAYS D_ERROR
10/27/15 11:10:05 (pid:7632) DaemonCore: command socket at <10.0.2.15:56373?noUDP>
10/27/15 11:10:05 (pid:7632) DaemonCore: private command socket at <10.0.2.15:56373>
10/27/15 11:10:05 (pid:7632) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#108616
10/27/15 11:10:05 (pid:7632) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_master (mtime:1445940567)
10/27/15 11:10:05 (pid:7632) Started DaemonCore process "/home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_startd", pid and pgroup = 7635
10/27/15 17:12:47 (pid:7632) Got SIGTERM. Performing graceful shutdown.
10/27/15 17:12:47 (pid:7632) Sent SIGTERM to STARTD (pid 7635)
10/27/15 17:12:49 (pid:7632) AllReaper unexpectedly called on pid 7635, status 25344.
10/27/15 17:12:49 (pid:7632) The STARTD (pid 7635) exited with status 99 (daemon will not restart automatically)
10/27/15 17:12:49 (pid:7632) All daemons are gone. Exiting.
10/27/15 17:12:49 (pid:7632) **** condor_master (condor_MASTER) pid 7632 EXITING WITH STATUS 0


Run-2 master log

10/27/15 17:15:13 (pid:27729) ******************************************************
10/27/15 17:15:13 (pid:27729) ** condor_master (CONDOR_MASTER) STARTING UP
10/27/15 17:15:13 (pid:27729) ** /home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_master
10/27/15 17:15:13 (pid:27729) ** SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1)
10/27/15 17:15:13 (pid:27729) ** Configuration: subsystem:MASTER local:<NONE> class:DAEMON
10/27/15 17:15:13 (pid:27729) ** $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $
10/27/15 17:15:13 (pid:27729) ** $CondorPlatform: x86_64_RedHat5 $
10/27/15 17:15:13 (pid:27729) ** PID = 27729
10/27/15 17:15:13 (pid:27729) ** Log last touched time unavailable (No such file or directory)
10/27/15 17:15:13 (pid:27729) ******************************************************
10/27/15 17:15:13 (pid:27729) Using config source: /home/boinc/CMSRun/glide_I5t966/condor_config
10/27/15 17:15:13 (pid:27729) config Macros = 213, Sorted = 213, StringBytes = 10579, TablesBytes = 7708
10/27/15 17:15:13 (pid:27729) CLASSAD_CACHING is OFF
10/27/15 17:15:13 (pid:27729) Daemon Log is logging: D_ALWAYS D_ERROR
10/27/15 17:15:13 (pid:27729) DaemonCore: command socket at <10.0.2.15:35944?noUDP>
10/27/15 17:15:13 (pid:27729) DaemonCore: private command socket at <10.0.2.15:35944>
10/27/15 17:15:14 (pid:27729) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#108541
10/27/15 17:15:14 (pid:27729) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_master (mtime:1445962467)
10/27/15 17:15:14 (pid:27729) Started DaemonCore process "/home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_startd", pid and pgroup = 27732
ID: 1353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1360 - Posted: 28 Oct 2015, 16:53:04 UTC

OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job).
ID: 1360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1364 - Posted: 28 Oct 2015, 22:25:11 UTC - in response to Message 1360.  

OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job).

Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right?
ID: 1364 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1365 - Posted: 28 Oct 2015, 23:29:06 UTC - in response to Message 1364.  
Last modified: 28 Oct 2015, 23:59:54 UTC

OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job).

Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right?

Only done one so far (74). I think 136MB and 2h 50mins wall time from Dashboard but
it took nearly an hour to get the job. Don't know how to find out how long
upload actually took but speed normally ca 900k so say 2.5 mins.

Edit. The following job (20) has already failed "stageout 60311" twice so it will
be interesting to see if it works here.... but I'll be asleep by then.
ID: 1365 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 101
Message 1368 - Posted: 29 Oct 2015, 10:16:46 UTC - in response to Message 1365.  
Last modified: 29 Oct 2015, 10:17:42 UTC

OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job).

Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right?

Only done one so far (74). I think 136MB and 2h 50mins wall time from Dashboard but
it took nearly an hour to get the job. Don't know how to find out how long
upload actually took but speed normally ca 900k so say 2.5 mins.

Edit. The following job (20) has already failed "stageout 60311" twice so it will
be interesting to see if it works here.... but I'll be asleep by then.

Job 20 finished here OK. Each try is shown on Dashboard with a different IP,
presumably the public IP of the machine on which it ran.
ID: 1368 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 1370 - Posted: 29 Oct 2015, 10:40:56 UTC - in response to Message 1364.  

OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job).

Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right?

The 1st job in a new created VM takes always a bit longer.
On my not the fastest i7 3 hours and 17 minutes.
At the completion of the job 129.92 MB was uploaded within 75 seconds.
ID: 1370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1373 - Posted: 29 Oct 2015, 13:33:23 UTC

We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing.
ID: 1373 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1381 - Posted: 30 Oct 2015, 0:42:22 UTC - in response to Message 1373.  

We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing.

I think it's obvious we've hit a limit at this point, even before any careful analysis (Laurence has a new student to look at things like this, perhaps he can do the actual culling of statistics, they are all there in Dashboard). Next batch I'll step back to 75 ev/job to fill in an intermediate point -- the apparent step-up from 50 to 100 was a bit unexpected.
ID: 1381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 1382 - Posted: 30 Oct 2015, 18:37:19 UTC

My task 68516, work unit 63658, took 3 days to complete - spent most of it's time in "Waiting to run" state in spite of CMS-Dev being my highest BOINC priority project. Looking at the log, it is mostly

2015-10-27 21:46:47 (608): Status Report: virtualbox.exe/vboxheadless.exe is no longer running.

Looks like when this happens the task attempts a retry MANY hours after the failure; don't understand VBox at all (can't get it to run on my Mac, where I do know what I'm doing, unlike Windows...).

Would be nice if either was notified something was wrong so it could be fixed on my end, or if task would just abort with an error, rather than keeping that CPU from running CMS indefinitely. (Until reboot?) That machine (Phoenix) was overheating with a GPU problem, fixed yesterday, or it might still be cycling the same task...
ID: 1382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 . . . 12 · Next

Message boards : Number crunching : Expect errors eventually


©2024 CERN