Message boards :
News :
Some jobs again
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
We can now submit jobs again. I'll submit a test batch overnight, and then try for a bigger test for the rest of the week. Feel free to start running tasks again, and report problems (and successes...) in the usual places. Thanks. |
Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 |
Is this a Problem or not: It has continued with 6th, 7th and more records |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare. Let us know if it continues. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch! As far as I can tell, 11 have yet to complete, the rest are being resubmitted twice more for reasons that are high on our bug-finding list. I'm not overly concerned about this as yet, as some will have struck problems and need re-running anyway. What I'm not sure about is if a job gets successfully run more than once, is the timestamp I see from the first or the last result to arrive. There are two things I want to compare in my first test (the abortive "challenge" from last week); a) The cululative percentage of results received as a function of time, and b) A comparison of some key data-quality parameters between the CMS@Home and the normal CRAB results (they should be statistically the same, and in fact job-for-job identical for jobs with the same index number). The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 134 |
Just caught this. Looks like a problem somewhere. Edit. Further output:- It's now running OK. The following job is producing the same output. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,772,298 RAC: 17,481 |
The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... So do you want me to turn some machines off or turn some more on ? :-) |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Basically, I expect things like that. I'm pushing into more speculative areas because there'll always be someone who says. "But we've already done that, why repeat it?" As I've said earlier, some aspect of this simulations aren't perfect; the one's I'm particularly interested in are much more advanced. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000). Since one of the figures of merit is how fast we can get to 90+% of submitted jobs being returned, yeah, more is better. But wait until I say "Go!" so you're not just spinning your wheels. Other things like running more than one VM per host still have to wait -- at the moment there are still sceptics to win around -- but hopefully I can get enough data to say a) if we can get this turnaround, imagine what it would be like with 10, 20, 50X more volunteers (x VMs), and b) the returned results are statistically insignificant from normal GRID jobs. After that potential burst, we'll probably go a bit quiet again, concentrating on procedures rather than throughput. At that stage you can obviously throttle back. And I guess at this stage I should reiterate -- Thanks to all for participating, we appreciate it and we hope you understand that we still operate under constraints and bugs yet to fix, yet to find. I can't put a timeline on when we can go into "production" -- in part that depends on the sceptics -- but I'll be disappointed if I'm still running tests into the New Year. Best, ivan |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch! ...and as of now we have 100% results returned! In about 7.5 hours. Congratulations. You'll still get resubs for a while while we sort out why CRAB isn't telling Condor that jobs have successfully finished. |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,772,298 RAC: 17,481 |
OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000). Sounds promising. My wheels don't spin needlessly, I also run DPAD which picks up the slack when either CMS (or Atlas) aren't maxing out their cores :-) Not too concerned about the constraints and bugs yet to fix, more than happy with the level of support and interaction with the 'development team' to getting problems resolved. Something some of the other CERN teams could work on. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare After a bit of searching, I don't think it's a problem for CMS@Home. The magnetic solenoid is 13 m long (internal diameter 6 m) so the point should be just inside one end. I'll go with my theory that floating-point inaccuracies led to a bizarre particle trajectory. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Just a catch-up, there's a fuller explanation in Number Crunching: while I can now submit jobs again, we're chasing some bugs that make it inefficient to run lots of jobs. So, I'm running small batches of short jobs from time to time to get feedback for the developers. If you have a spare CPU to keep a VM task running, fine -- you may pick up a short job from time to time. Rest assured, I will let you know slightly in advance when I can release a large batch of jobs again. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
I'm about to do a mini-blitz of short jobs so that Laurence and Hassen can do some monitoring. I'm hoping it will last into tomorrow, but not much further as there's a separate tack to be taken that requires me to be promptly monitoring as soon as the run starts. I'm not sure I can do that from home with the mini-blitz, so feel free to suck in as many of these jobs as you can, to get them finished overnight. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan, for letting us know! |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Thanks, Ivan, for letting us know! There's a problem with the CRAB server submitting the jobs, it's failing with no error message. It's done this before, usually 2 or 3 retries gets through. I'll keep trying until midnight London time. [Edit] "Houston, we have submission!" Enjoy! [/Edit] [Edit^2] There seem to be about 47 active machines at the moment. :-) [/Edit^2] |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,772,298 RAC: 17,481 |
Ooh, something is keeping my machines busy :-) |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,920,398 RAC: 2,619 |
Ooh, something is keeping my machines busy :-) Glad you like it; latest status has |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,772,298 RAC: 17,481 |
The console is showing >90% for cmsrun but the logs are still all timed from early this afternoon on the one in front of me ! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Did you do a refresh on your browser? |
Send message Joined: 20 May 15 Posts: 217 Credit: 5,772,298 RAC: 17,481 |
I looked at the console first and hadn't been to the logs today. But just to confirm, I have done a refresh and they are still only showing times before 2pm. |
©2024 CERN