Expect errors eventually

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1340 - Posted: 26 Oct 2015, 22:11:36 UTC No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful. ID: 1340 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101	Message 1341 - Posted: 27 Oct 2015, 10:11:24 UTC - in response to Message 1340. Thanks for the warning. Could this explain the unexplained "validate errors" that are causing some disquiet over there, one wonders. Atlas themselves don't say much. ID: 1341 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1342 - Posted: 27 Oct 2015, 10:13:29 UTC - in response to Message 1341. Thanks for the warning. Could this explain the unexplained "validate errors" that are causing some disquiet over there, one wonders. Atlas themselves don't say much. No, I can't comment. I know next-to-nothing about the Atlas project. ID: 1342 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 1343 - Posted: 27 Oct 2015, 10:26:34 UTC - in response to Message 1340. No-one's commented yet, but some of you may have noticed that this batch of jobs are longer than the last. That's because I want to explore if we do have problems with longer jobs or larger result files. There has been some suggestion in the past that bigger files may lead to transfer time-outs. So, I'm increasing the number of events in each job to try to provoke failures, to get some idea of where our limits are. I'll probably push on until we get 5-10% failures (i.e. approaching Atlas's failure rate), but at present this batch is looking hopeful. Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch. Is the result file about 66MB and transferred to you without compression? No problem here - was transferred within ~32 seconds. ID: 1343 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0	Message 1345 - Posted: 27 Oct 2015, 12:29:34 UTC if necessary, you can make it selectable for the user if he wants to process only small or even larger Jobs ID: 1345 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1347 - Posted: 27 Oct 2015, 14:59:14 UTC - in response to Message 1343. Already noticed the increase from 25 to 50 events, but on average an event is a bit faster processed in this batch, so the total runtime per job is about 160% of a job from the previous batch. Is the result file about 66MB and transferred to you without compression? No problem here - was transferred within ~32 seconds. Yes to the first; not sure about the second, I've not seen anything explicit but it may be a default (not much point in compressing though, the ROOT files are already compressed). ID: 1347 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1348 - Posted: 27 Oct 2015, 15:12:36 UTC Is there a chance to make it possible to use 2 or more cores? I have tried trough the app_info.xml, but it really only uses one, even though the virtual box has two assigned. ID: 1348 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1349 - Posted: 27 Oct 2015, 16:31:13 UTC I had a very long delay between jobs. One hour between the end of event 50 and event 1 of the next job. Normal is about 20 minutes. Which log might contain any info on that? All console windows did not show anything unusual. ID: 1349 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1350 - Posted: 27 Oct 2015, 21:27:15 UTC - in response to Message 1349. Last modified: 27 Oct 2015, 21:28:48 UTC I had a very long delay between jobs. One hour between the end of event 50 and event 1 of the next job. Normal is about 20 minutes. Which log might contain any info on that? All console windows did not show anything unusual. There's a glide-in log, isn't there? I can't run jobs at home, due to my poor ADSL, so I have to rely on memory. Check to see if it was within one glide-in instance (which calls several successive jobs) or between glide-ins. The logs I get back are some combination of Condor and application logs, on a per-job basis so I don't think they have much inter-job information. We seem to be running smoothly at the moment, so it might have been a network blockage somewhere. ID: 1350 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1351 - Posted: 27 Oct 2015, 22:01:52 UTC - in response to Message 1350. glidein download started 6 min before the 1 event of the following run. This would explain 6 of the extra 40 min it took to start the next job. If there was a network outage of some kind, would that not be logging the retries? ID: 1351 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1352 - Posted: 27 Oct 2015, 22:53:42 UTC - in response to Message 1351. If there was a network outage of some kind, would that not be logging the retries? No idea, really, I'm not a network expert. ...but I have seen connections hang for some time and come back up when the physical link is restored. As I recall, Linux is(was) far more resilient than Windows in this. ID: 1352 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0	Message 1353 - Posted: 27 Oct 2015, 23:10:11 UTC - in response to Message 1352. The delay happened before 17.12h . If it helps, look at the bold section(or show it to someone). I will keep an eye on it to see, if this happens again. It is not a serious issue, as it keeps running, but if you( or someone)has the time... Run-1 master.log 10/27/15 11:10:05 (pid:7632) **************************************************** 10/27/15 11:10:05 (pid:7632) condor_master (CONDOR_MASTER) STARTING UP 10/27/15 11:10:05 (pid:7632) /home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_master 10/27/15 11:10:05 (pid:7632) SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 10/27/15 11:10:05 (pid:7632) Configuration: subsystem:MASTER local:<NONE> class:DAEMON 10/27/15 11:10:05 (pid:7632) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 10/27/15 11:10:05 (pid:7632) $CondorPlatform: x86_64_RedHat5 $ 10/27/15 11:10:05 (pid:7632) PID = 7632 10/27/15 11:10:05 (pid:7632) Log last touched time unavailable (No such file or directory) 10/27/15 11:10:05 (pid:7632) ************************************************** 10/27/15 11:10:05 (pid:7632) Using config source: /home/boinc/CMSRun/glide_6VklMu/condor_config 10/27/15 11:10:05 (pid:7632) config Macros = 213, Sorted = 213, StringBytes = 10574, TablesBytes = 7708 10/27/15 11:10:05 (pid:7632) CLASSAD_CACHING is OFF 10/27/15 11:10:05 (pid:7632) Daemon Log is logging: D_ALWAYS D_ERROR 10/27/15 11:10:05 (pid:7632) DaemonCore: command socket at <10.0.2.15:56373?noUDP> 10/27/15 11:10:05 (pid:7632) DaemonCore: private command socket at <10.0.2.15:56373> 10/27/15 11:10:05 (pid:7632) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9622 as ccbid 130.246.180.120:9622#108616 10/27/15 11:10:05 (pid:7632) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_master (mtime:1445940567) 10/27/15 11:10:05 (pid:7632) Started DaemonCore process "/home/boinc/CMSRun/glide_6VklMu/main/condor/sbin/condor_startd", pid and pgroup = 7635 10/27/15 17:12:47 (pid:7632) Got SIGTERM. Performing graceful shutdown. 10/27/15 17:12:47 (pid:7632) Sent SIGTERM to STARTD (pid 7635) 10/27/15 17:12:49 (pid:7632) AllReaper unexpectedly called on pid 7635, status 25344. 10/27/15 17:12:49 (pid:7632) The STARTD (pid 7635) exited with status 99 (daemon will not restart automatically) 10/27/15 17:12:49 (pid:7632) All daemons are gone. Exiting. 10/27/15 17:12:49 (pid:7632) condor_master (condor_MASTER) pid 7632 EXITING WITH STATUS 0 Run-2 master log 10/27/15 17:15:13 (pid:27729) ************************************************** 10/27/15 17:15:13 (pid:27729) condor_master (CONDOR_MASTER) STARTING UP 10/27/15 17:15:13 (pid:27729) /home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_master 10/27/15 17:15:13 (pid:27729) SubsystemInfo: name=MASTER type=MASTER(2) class=DAEMON(1) 10/27/15 17:15:13 (pid:27729) Configuration: subsystem:MASTER local:<NONE> class:DAEMON 10/27/15 17:15:13 (pid:27729) $CondorVersion: 8.2.3 Sep 30 2014 BuildID: 274619 $ 10/27/15 17:15:13 (pid:27729) $CondorPlatform: x86_64_RedHat5 $ 10/27/15 17:15:13 (pid:27729) PID = 27729 10/27/15 17:15:13 (pid:27729) Log last touched time unavailable (No such file or directory) 10/27/15 17:15:13 (pid:27729) **************************************************** 10/27/15 17:15:13 (pid:27729) Using config source: /home/boinc/CMSRun/glide_I5t966/condor_config 10/27/15 17:15:13 (pid:27729) config Macros = 213, Sorted = 213, StringBytes = 10579, TablesBytes = 7708 10/27/15 17:15:13 (pid:27729) CLASSAD_CACHING is OFF 10/27/15 17:15:13 (pid:27729) Daemon Log is logging: D_ALWAYS D_ERROR 10/27/15 17:15:13 (pid:27729) DaemonCore: command socket at <10.0.2.15:35944?noUDP> 10/27/15 17:15:13 (pid:27729) DaemonCore: private command socket at <10.0.2.15:35944> 10/27/15 17:15:14 (pid:27729) CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9620 as ccbid 130.246.180.120:9620#108541 10/27/15 17:15:14 (pid:27729) Master restart (GRACEFUL) is watching /home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_master (mtime:1445962467) 10/27/15 17:15:14 (pid:27729) Started DaemonCore process "/home/boinc/CMSRun/glide_I5t966/main/condor/sbin/condor_startd", pid and pgroup = 27732 ID: 1353 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1360 - Posted: 28 Oct 2015, 16:53:04 UTC OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). ID: 1360 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1364 - Posted: 28 Oct 2015, 22:25:11 UTC - in response to Message 1360. OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right? ID: 1364 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101	Message 1365 - Posted: 28 Oct 2015, 23:29:06 UTC - in response to Message 1364. Last modified: 28 Oct 2015, 23:59:54 UTC OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right? Only done one so far (74). I think 136MB and 2h 50mins wall time from Dashboard but it took nearly an hour to get the job. Don't know how to find out how long upload actually took but speed normally ca 900k so say 2.5 mins. Edit. The following job (20) has already failed "stageout 60311" twice so it will be interesting to see if it works here.... but I'll be asleep by then. ID: 1365 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 101	Message 1368 - Posted: 29 Oct 2015, 10:16:46 UTC - in response to Message 1365. Last modified: 29 Oct 2015, 10:17:42 UTC OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right? Only done one so far (74). I think 136MB and 2h 50mins wall time from Dashboard but it took nearly an hour to get the job. Don't know how to find out how long upload actually took but speed normally ca 900k so say 2.5 mins. Edit. The following job (20) has already failed "stageout 60311" twice so it will be interesting to see if it works here.... but I'll be asleep by then. Job 20 finished here OK. Each try is shown on Dashboard with a different IP, presumably the public IP of the machine on which it ran. ID: 1368 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,977 RAC: 1,466	Message 1370 - Posted: 29 Oct 2015, 10:40:56 UTC - in response to Message 1364. OK, I've submitted another 1,000 jobs, twice as large again (i.e. 100 events per job). Looks like ~128 MB per job in <= 2 hours for most machines. Does that sound right? The 1st job in a new created VM takes always a bit longer. On my not the fastest i7 3 hours and 17 minutes. At the completion of the job 129.92 MB was uploaded within 75 seconds. ID: 1370 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1373 - Posted: 29 Oct 2015, 13:33:23 UTC We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing. ID: 1373 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,937,121 RAC: 3,148	Message 1381 - Posted: 30 Oct 2015, 0:42:22 UTC - in response to Message 1373. We are getting a lot of stage-out errors. The precise reason isn't evident in the logs to my not-very-practised eye. There were some network problems at CERN last night but it seems the higher error rate is continuing. I think it's obvious we've hit a limit at this point, even before any careful analysis (Laurence has a new student to look at things like this, perhaps he can do the actual culling of statistics, they are all there in Dashboard). Next batch I'll step back to 75 ev/job to fill in an intermediate point -- the apparent step-up from 50 to 100 was a bit unexpected. ID: 1381 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1382 - Posted: 30 Oct 2015, 18:37:19 UTC My task 68516, work unit 63658, took 3 days to complete - spent most of it's time in "Waiting to run" state in spite of CMS-Dev being my highest BOINC priority project. Looking at the log, it is mostly 2015-10-27 21:46:47 (608): Status Report: virtualbox.exe/vboxheadless.exe is no longer running. Looks like when this happens the task attempts a retry MANY hours after the failure; don't understand VBox at all (can't get it to run on my Mac, where I do know what I'm doing, unlike Windows...). Would be nice if either was notified something was wrong so it could be fixed on my end, or if task would just abort with an error, rather than keeping that CPU from running CMS indefinitely. (Until reboot?) That machine (Phoenix) was overheating with a GPU problem, fixed yesterday, or it might still be cycling the same task... ID: 1382 · Rating: 0 · rate: / Reply Quote

Development for LHC@home