Expect errors eventually

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1468 - Posted: 16 Nov 2015, 10:50:59 UTC Last modified: 16 Nov 2015, 11:50:56 UTC The Condor server crashed overnight. It's been restarted but it appears not all services are up yet. You may not get jobs for a while until it's fully healthy. 11:50 -- completed jobs are coming in again. ID: 1468 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1473 - Posted: 16 Nov 2015, 19:55:15 UTC Okay, can't prove it yet, but VirtualBox 5.0.10 SEEMS to have solved the hypervisor problem. CMS ran on all my Windows hosts overnight without any timeouts waiting for me this morning. :-) And on the Mac... IT GOT WORK! Don't know for sure if it was the change to 5.0.10, or the fact that I had VBox actually RUNNING overnight when CMS decided to fetch work, but regardless, there is a task actually running on the Mac finally. (#1!) Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. ID: 1473 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1474 - Posted: 16 Nov 2015, 19:59:45 UTC - in response to Message 1473. Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. ID: 1474 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159	Message 1475 - Posted: 16 Nov 2015, 21:33:25 UTC - in response to Message 1473. Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. This is expected behaviour, when you've installed BOINC in protected execution mode (service mode). ID: 1475 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1476 - Posted: 16 Nov 2015, 22:24:38 UTC - in response to Message 1474. Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. Not Kaspersky, as I'm not running it. Or any non-built-in antivirus, since these hosts are only connected to the net for BOINC, do no web browsing. That's what the Mac is for! :-) ID: 1476 · Rating: 0 · rate: / Reply Quote

Tern Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0	Message 1477 - Posted: 16 Nov 2015, 22:34:01 UTC - in response to Message 1475. Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. This is expected behaviour, when you've installed BOINC in protected execution mode (service mode). Unless that's the default (only choice?) on Mac, I don't think I selected service mode. Know I was offered the choice on Windows and didn't take it. Of course, the Mac is the only machine I have that isn't just for BOINC, Photoshop, media center, or NAS RAID storage, so the Windows boxes are all single-user. Shrug. Doesn't bother me either way, I'm NOT a fan of VBox and will probably not run it once CMS goes to production mode. Didn't know what I was getting into when I agreed to run it here. My policy is, to get me to sign up, a project MUST have a Mac app (with rare exceptions). I hadn't dealt with the "Mac but through VBox" issue before, but until it becomes the only choice, I'll modify my policy to be "Mac native" app available. Comes from the years I spent as a Mac developer. :-) ID: 1477 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1478 - Posted: 17 Nov 2015, 19:50:58 UTC - in response to Message 1474. Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. Tried again this afternoon, after turning off the remaining active elements of Kaspersky, but no joy. ID: 1478 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1485 - Posted: 18 Nov 2015, 21:54:52 UTC Right, more news about what this thread was about originally, before we ran into problems... At least we now know a few more things to keep our eyes on as we continue. The current batch of 25-event jobs will run out about mid-day tomorrow. There are also a couple of very short batches to run, submitted to check whether I can now use "production" CRAB3 servers rather than the "preprod" one we've been using up until now. The submissions went fine, just need to see that the actual running and stage-out goes as planned. So when we're down to just a few jobs in the queue, assuming everything is still in the green, I'll revisit the larger 100-event jobs we were running when everything started to go pear-shaped (or as The Register has it, Total Inability To Support Usual Performance). Stay tuned for further news. ID: 1485 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1488 - Posted: 21 Nov 2015, 17:59:15 UTC - in response to Message 1485. Retournons à nos moutons¹... The 100-event jobs have gone better than I expected. Looks like about 3% fail due to over-running, etc., but we seem to be saved by the retry mechanism. 3 of the original batch of 100 are still outstanding but I see them in the Condor queue on the server; Dashboard currently has 37 failures marked in the present 1000-job batch but I expect that to fall when retries cut in. I'll push the boat out and try a 200-event batch when the pending queue gets down to just a few. Then I really do expect failures. ¹For those who don't speak French, literally "Let us return to our sheep"; colloquially, "Back to the original subject." ID: 1488 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159	Message 1489 - Posted: 23 Nov 2015, 11:10:28 UTC - in response to Message 1488. I'll push the boat out and try a 200-event batch when the pending queue gets down to just a few. Then I really do expect failures. New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. ID: 1489 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1490 - Posted: 23 Nov 2015, 15:37:38 UTC - in response to Message 1489. New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. Did you catch the Job-ID for that? The only completed logs I have for you are both successes. ID: 1490 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159	Message 1491 - Posted: 23 Nov 2015, 21:15:40 UTC - in response to Message 1490. New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. Did you catch the Job-ID for that? The only completed logs I have for you are both successes. It were not my failures, but saw a lot of this failures on the dashboard. example ID's: 228 2nd attempt 236 278 2nd attempt 386 2nd attempt 407 449 451 etc etc ID: 1491 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1492 - Posted: 24 Nov 2015, 1:46:22 UTC - in response to Message 1491. Last modified: 24 Nov 2015, 3:04:19 UTC OK, thanks for the report. All one host, some sort of segmentation violation on the first event, with a traceback longer than a basketballer's arm. The task in question was sent 18/11 so I guess the machine was switched off for a while and lost connectivity or corrupted the VM. I'll PM him and ask him to abort and reset. So that accounts for 40 of our failures in this batch [Edit: Actually 52, the host had another failure mode... /]. Another 15 or so are due to a host failing to read the conditions database on startup, but he's a vLHC user so I'm loth to press the point. As I said, I expected failures. ID: 1492 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1493 - Posted: 25 Nov 2015, 1:09:42 UTC - in response to Message 1492. OK, for this particular workflow the "sweet spot" seems to be around 50 events per job. So, I've submitted a batch of 5,000 of these because I need to get on with my real work without constant interruptions. :-) I may not be terribly visible until next week, unless a disaster looms. (Just spent two days being continually interrupted by a failing oxygen-level sensor alarm in the hydrogen fuel-cell lab next door -- very high-pitched continuous tone (luckily I'm relatively deaf at those frequencies, too much long-distance high-speed motorcycle touring...) -- so I've not got much done yet this week.) ID: 1493 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159	Message 1495 - Posted: 30 Nov 2015, 9:41:47 UTC It seems you launched a new batch: 151130_085756:ireid_crab_CMS_at_Home_MinBias_prod2 The first one of that batch I got (job 42) deceased: Job Running time in seconds: 6 Job runtime is less than 20minutes. Sleeping 1194 2nd is running now with 250 events to do, but those MinBias events seem to process fast. ID: 1495 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1496 - Posted: 30 Nov 2015, 11:45:07 UTC - in response to Message 1495. It seems you launched a new batch: 151130_085756:ireid_crab_CMS_at_Home_MinBias_prod2 Yes; I did a short one yesterday, woke up this morning to find only one left in the queue... The first one of that batch I got (job 42) deceased: Job Running time in seconds: 6 Job runtime is less than 20minutes. Sleeping 1194 There's a zero-length log-file for job 42.0; 42.1.log is 47 bytes which indicates it hasn't been retried yet: Job output has not been processed by post-job. 2nd is running now with 250 events to do, but those MinBias events seem to process fast. Indeed, the MinBias (i.e. background) events have far fewer particles to process that a TTbar event, unless of course the roll of the dice throws up a TTbar event! ID: 1496 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1497 - Posted: 1 Dec 2015, 10:48:49 UTC Last modified: 1 Dec 2015, 20:03:04 UTC The current dip in "jobs finished" in the Job Activity graphs appears to be a failure by Dashboard to process the final status report and move jobs from WNPostProc state to Success. The result files arrived OK. [Later] The backlog seems to be slowly catching up, the dip is getting narrower. [/Later] ID: 1497 · Rating: 0 · rate: / Reply Quote

rbpeake Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0	Message 1499 - Posted: 3 Dec 2015, 20:20:32 UTC Just curious to see your timing goal to start production work? These latest 250 unit jobs seem to be going well for me. Thanks! ID: 1499 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800	Message 1500 - Posted: 3 Dec 2015, 23:39:42 UTC - in response to Message 1499. Last modified: 3 Dec 2015, 23:40:28 UTC Just curious to see your timing goal to start production work? These latest 250 unit jobs seem to be going well for me. Thanks! Good question. Personally I'd hoped to be at it by now, but there have been setbacks. In the last week we've "lost" a lot of the results (~20%) which just haven't made it to the data-bridge although the log files claim a successful lcg-copy transfer and Dashboard reports it as such. I've been too busy with other things to delve into it too deeply; basically I have to make a log of these instances and then send that to the CERN experts to see if they can make correlations with other logs -- which is time-consuming. The other thing I'm waiting on is a better method of submitting batch jobs -- called WMagent -- rather than the CRAB3 I currently use. It's supposed to be coming RSN (Real Soon Now, © Jerry Pournelle) and will allow the "production" team to submit properly-requested work-flows to the system. Which hopefully will take some pressure off me! Understand that I'm just the public face of the project; I can make work to be sent out and try to catch problems and report them to CERN and CMS IT who know how it all works, but I've no real control over the software and have to wait for responses from people who admittedly have other tasks to occupy them as well. For example, some queries in these forums I have to leave unanswered because I Just Don't Know (® me). It looks like my next major report to the Collaboration won't be until the New Year, so don't expect significant changes before then. Good to hear the current batch is successful for you. We're going to run out about mid-day tomorrow my time. The median time is around 1-1.5 hours so I'll submit a round of 500-event jobs to increase the efficiency (start-up time per job is a dominant factor); the result files will go up to ~140 MB but we seem to be able to handle that reasonably well from past experience. There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. ID: 1500 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 1	Message 1501 - Posted: 4 Dec 2015, 1:23:58 UTC - in response to Message 1500. I'll submit a round of 500-event jobs to increase the efficiency (start-up time per job is a dominant factor); If I might make a point on behalf of those of us whose hosts don't run continuously (there will be more in a production environment with the project open to "all comers". BOINC is, after all, supposed to use "spare time"). Until such time as the project stops abandoning work in progress when the host reboots, please remember that the longer jobs take to run, the more time will be wasted in this way. At some point, (about six hours for me) a host won't get any useful work done at all. ID: 1501 · Rating: 0 · rate: / Reply Quote

Development for LHC@home