Message boards :
Number crunching :
Expect errors eventually
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 12 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
The Condor server crashed overnight. It's been restarted but it appears not all services are up yet. You may not get jobs for a while until it's fully healthy. 11:50 -- completed jobs are coming in again. |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Okay, can't prove it yet, but VirtualBox 5.0.10 SEEMS to have solved the hypervisor problem. CMS ran on all my Windows hosts overnight without any timeouts waiting for me this morning. :-) And on the Mac... IT GOT WORK! Don't know for sure if it was the change to 5.0.10, or the fact that I had VBox actually RUNNING overnight when CMS decided to fetch work, but regardless, there is a task actually running on the Mac finally. (#1!) Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159 |
Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. This is expected behaviour, when you've installed BOINC in protected execution mode (service mode). |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. Not Kaspersky, as I'm not running it. Or any non-built-in antivirus, since these hosts are only connected to the net for BOINC, do no web browsing. That's what the Mac is for! :-) |
Send message Joined: 21 Sep 15 Posts: 89 Credit: 383,017 RAC: 0 |
Now oddly enough, vboxheadless shows to be running under boinc userid, but there is no VM shown in the VBox window under _my_ userid. Windows DOES show boinc tasks there. I've been afraid to quit out of VBox, even though it seems to be two different processes, just in case it would kill the CMS task. Maybe on the next one. Unless that's the default (only choice?) on Mac, I don't think I selected service mode. Know I was offered the choice on Windows and didn't take it. Of course, the Mac is the only machine I have that isn't just for BOINC, Photoshop, media center, or NAS RAID storage, so the Windows boxes are all single-user. Shrug. Doesn't bother me either way, I'm NOT a fan of VBox and will probably not run it once CMS goes to production mode. Didn't know what I was getting into when I agreed to run it here. My policy is, to get me to sign up, a project MUST have a Mac app (with rare exceptions). I hadn't dealt with the "Mac but through VBox" issue before, but until it becomes the only choice, I'll modify my policy to be "Mac native" app available. Comes from the years I spent as a Mac developer. :-) |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
Good to hear. I tried 5.0.10 on my work Windows7 box today, and still got the dreaded message after 10 minutes that it failed to enter a running state in a timely manner. I'm starting to wonder if it's the Kaspersky anti-virus, I've got as much of its "real time" checking turned off as possible, but it still manages to give me grief from time to time. Tried again this afternoon, after turning off the remaining active elements of Kaspersky, but no joy. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
Right, more news about what this thread was about originally, before we ran into problems... At least we now know a few more things to keep our eyes on as we continue. The current batch of 25-event jobs will run out about mid-day tomorrow. There are also a couple of very short batches to run, submitted to check whether I can now use "production" CRAB3 servers rather than the "preprod" one we've been using up until now. The submissions went fine, just need to see that the actual running and stage-out goes as planned. So when we're down to just a few jobs in the queue, assuming everything is still in the green, I'll revisit the larger 100-event jobs we were running when everything started to go pear-shaped (or as The Register has it, Total Inability To Support Usual Performance). Stay tuned for further news. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
Retournons à nos moutons¹... The 100-event jobs have gone better than I expected. Looks like about 3% fail due to over-running, etc., but we seem to be saved by the retry mechanism. 3 of the original batch of 100 are still outstanding but I see them in the Condor queue on the server; Dashboard currently has 37 failures marked in the present 1000-job batch but I expect that to fall when retries cut in. I'll push the boat out and try a 200-event batch when the pending queue gets down to just a few. Then I really do expect failures. ¹For those who don't speak French, literally "Let us return to our sheep"; colloquially, "Back to the original subject." |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159 |
I'll push the boat out and try a 200-event batch when the pending queue gets down to just a few. Then I really do expect failures. New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. Did you catch the Job-ID for that? The only completed logs I have for you are both successes. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159 |
New kind of failures this morning?: 134 / Abort (ANSI) or IOT trap (4.2 BSD) after about 20 minutes run time. It were not my failures, but saw a lot of this failures on the dashboard. example ID's: 228 2nd attempt 236 278 2nd attempt 386 2nd attempt 407 449 451 etc etc |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
OK, thanks for the report. All one host, some sort of segmentation violation on the first event, with a traceback longer than a basketballer's arm. The task in question was sent 18/11 so I guess the machine was switched off for a while and lost connectivity or corrupted the VM. I'll PM him and ask him to abort and reset. So that accounts for 40 of our failures in this batch [Edit: Actually 52, the host had another failure mode... /]. Another 15 or so are due to a host failing to read the conditions database on startup, but he's a vLHC user so I'm loth to press the point. As I said, I expected failures. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
OK, for this particular workflow the "sweet spot" seems to be around 50 events per job. So, I've submitted a batch of 5,000 of these because I need to get on with my real work without constant interruptions. :-) I may not be terribly visible until next week, unless a disaster looms. (Just spent two days being continually interrupted by a failing oxygen-level sensor alarm in the hydrogen fuel-cell lab next door -- very high-pitched continuous tone (luckily I'm relatively deaf at those frequencies, too much long-distance high-speed motorcycle touring...) -- so I've not got much done yet this week.) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 854,498 RAC: 159 |
It seems you launched a new batch: 151130_085756:ireid_crab_CMS_at_Home_MinBias_prod2 The first one of that batch I got (job 42) deceased: Job Running time in seconds: 6 Job runtime is less than 20minutes. Sleeping 1194 2nd is running now with 250 events to do, but those MinBias events seem to process fast. |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
It seems you launched a new batch: 151130_085756:ireid_crab_CMS_at_Home_MinBias_prod2Yes; I did a short one yesterday, woke up this morning to find only one left in the queue... The first one of that batch I got (job 42) deceased:There's a zero-length log-file for job 42.0; 42.1.log is 47 bytes which indicates it hasn't been retried yet: Job output has not been processed by post-job. 2nd is running now with 250 events to do, but those MinBias events seem to process fast. Indeed, the MinBias (i.e. background) events have far fewer particles to process that a TTbar event, unless of course the roll of the dice throws up a TTbar event! |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
The current dip in "jobs finished" in the Job Activity graphs appears to be a failure by Dashboard to process the final status report and move jobs from WNPostProc state to Success. The result files arrived OK. [Later] The backlog seems to be slowly catching up, the dip is getting narrower. [/Later] |
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
Just curious to see your timing goal to start production work? These latest 250 unit jobs seem to be going well for me. Thanks! |
Send message Joined: 20 Jan 15 Posts: 1136 Credit: 7,991,905 RAC: 800 |
Just curious to see your timing goal to start production work? These latest 250 unit jobs seem to be going well for me. Good question. Personally I'd hoped to be at it by now, but there have been setbacks. In the last week we've "lost" a lot of the results (~20%) which just haven't made it to the data-bridge although the log files claim a successful lcg-copy transfer and Dashboard reports it as such. I've been too busy with other things to delve into it too deeply; basically I have to make a log of these instances and then send that to the CERN experts to see if they can make correlations with other logs -- which is time-consuming. The other thing I'm waiting on is a better method of submitting batch jobs -- called WMagent -- rather than the CRAB3 I currently use. It's supposed to be coming RSN (Real Soon Now, © Jerry Pournelle) and will allow the "production" team to submit properly-requested work-flows to the system. Which hopefully will take some pressure off me! Understand that I'm just the public face of the project; I can make work to be sent out and try to catch problems and report them to CERN and CMS IT who know how it all works, but I've no real control over the software and have to wait for responses from people who admittedly have other tasks to occupy them as well. For example, some queries in these forums I have to leave unanswered because I Just Don't Know (® me). It looks like my next major report to the Collaboration won't be until the New Year, so don't expect significant changes before then. Good to hear the current batch is successful for you. We're going to run out about mid-day tomorrow my time. The median time is around 1-1.5 hours so I'll submit a round of 500-event jobs to increase the efficiency (start-up time per job is a dominant factor); the result files will go up to ~140 MB but we seem to be able to handle that reasonably well from past experience. There was some discussion yesterday on whether CMS should implement a different model, an "event server" rather than a "job server", i.e. the job starts and then requests events as it finishes the previous lot rather than the current model where each job starts, processes a given number of events and then stops. I mention this for completeness, I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 1 |
I'll submit a round of 500-event jobs to increase the efficiency (start-up time per job is a dominant factor); If I might make a point on behalf of those of us whose hosts don't run continuously (there will be more in a production environment with the project open to "all comers". BOINC is, after all, supposed to use "spare time"). Until such time as the project stops abandoning work in progress when the host reboots, please remember that the longer jobs take to run, the more time will be wasted in this way. At some point, (about six hours for me) a host won't get any useful work done at all. |
©2024 CERN