Message boards : News : Some jobs again
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 839 - Posted: 25 Aug 2015, 16:37:29 UTC

We can now submit jobs again. I'll submit a test batch overnight, and then try for a bigger test for the rest of the week. Feel free to start running tasks again, and report problems (and successes...) in the usual places.
Thanks.
ID: 839 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 842 - Posted: 25 Aug 2015, 17:28:19 UTC - in response to Message 839.  

Is this a Problem or not:



It has continued with 6th, 7th and more records
ID: 842 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 843 - Posted: 25 Aug 2015, 20:11:31 UTC - in response to Message 842.  

I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare.
Let us know if it continues.
ID: 843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 846 - Posted: 25 Aug 2015, 20:39:37 UTC

Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch! As far as I can tell, 11 have yet to complete, the rest are being resubmitted twice more for reasons that are high on our bug-finding list. I'm not overly concerned about this as yet, as some will have struck problems and need re-running anyway. What I'm not sure about is if a job gets successfully run more than once, is the timestamp I see from the first or the last result to arrive.
There are two things I want to compare in my first test (the abortive "challenge" from last week); a) The cululative percentage of results received as a function of time, and b) A comparison of some key data-quality parameters between the CMS@Home and the normal CRAB results (they should be statistically the same, and in fact job-for-job identical for jobs with the same index number).
The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting...
ID: 846 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 134
Message 847 - Posted: 25 Aug 2015, 21:03:57 UTC
Last modified: 25 Aug 2015, 21:50:48 UTC

Just caught this. Looks like a problem somewhere.



Edit. Further output:-



It's now running OK.

The following job is producing the same output.
ID: 847 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,772,298
RAC: 17,481
Message 848 - Posted: 25 Aug 2015, 21:22:36 UTC - in response to Message 846.  

The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting...

So do you want me to turn some machines off or turn some more on ? :-)
ID: 848 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 849 - Posted: 25 Aug 2015, 22:55:05 UTC - in response to Message 847.  

Basically, I expect things like that. I'm pushing into more speculative areas because there'll always be someone who says. "But we've already done that, why repeat it?" As I've said earlier, some aspect of this simulations aren't perfect; the one's I'm particularly interested in are much more advanced.
ID: 849 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 850 - Posted: 25 Aug 2015, 23:17:26 UTC - in response to Message 848.  

The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting...

So do you want me to turn some machines off or turn some more on ? :-)

OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000).
Since one of the figures of merit is how fast we can get to 90+% of submitted jobs being returned, yeah, more is better. But wait until I say "Go!" so you're not just spinning your wheels.
Other things like running more than one VM per host still have to wait -- at the moment there are still sceptics to win around -- but hopefully I can get enough data to say a) if we can get this turnaround, imagine what it would be like with 10, 20, 50X more volunteers (x VMs), and b) the returned results are statistically insignificant from normal GRID jobs.
After that potential burst, we'll probably go a bit quiet again, concentrating on procedures rather than throughput. At that stage you can obviously throttle back.
And I guess at this stage I should reiterate -- Thanks to all for participating, we appreciate it and we hope you understand that we still operate under constraints and bugs yet to fix, yet to find. I can't put a timeline on when we can go into "production" -- in part that depends on the sceptics -- but I'll be disappointed if I'm still running tests into the New Year.

Best, ivan
ID: 850 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 851 - Posted: 25 Aug 2015, 23:25:38 UTC - in response to Message 846.  
Last modified: 25 Aug 2015, 23:26:07 UTC

Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch!

...and as of now we have 100% results returned! In about 7.5 hours. Congratulations. You'll still get resubs for a while while we sort out why CRAB isn't telling Condor that jobs have successfully finished.
ID: 851 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,772,298
RAC: 17,481
Message 852 - Posted: 25 Aug 2015, 23:30:24 UTC - in response to Message 850.  
Last modified: 25 Aug 2015, 23:31:24 UTC

OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000).
Since one of the figures of merit is how fast we can get to 90+% of submitted jobs being returned, yeah, more is better. But wait until I say "Go!" so you're not just spinning your wheels.
Other things like running more than one VM per host still have to wait -- at the moment there are still sceptics to win around -- but hopefully I can get enough data to say a) if we can get this turnaround, imagine what it would be like with 10, 20, 50X more volunteers (x VMs), and b) the returned results are statistically insignificant from normal GRID jobs.
After that potential burst, we'll probably go a bit quiet again, concentrating on procedures rather than throughput. At that stage you can obviously throttle back.
And I guess at this stage I should reiterate -- Thanks to all for participating, we appreciate it and we hope you understand that we still operate under constraints and bugs yet to fix, yet to find. I can't put a timeline on when we can go into "production" -- in part that depends on the sceptics -- but I'll be disappointed if I'm still running tests into the New Year.

Best, ivan


Sounds promising.

My wheels don't spin needlessly, I also run DPAD which picks up the slack when either CMS (or Atlas) aren't maxing out their cores :-)

Not too concerned about the constraints and bugs yet to fix, more than happy with the level of support and interaction with the 'development team' to getting problems resolved. Something some of the other CERN teams could work on.
ID: 852 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 853 - Posted: 26 Aug 2015, 9:22:36 UTC - in response to Message 843.  

I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare

After a bit of searching, I don't think it's a problem for CMS@Home. The magnetic solenoid is 13 m long (internal diameter 6 m) so the point should be just inside one end. I'll go with my theory that floating-point inaccuracies led to a bizarre particle trajectory.
ID: 853 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 858 - Posted: 27 Aug 2015, 0:18:30 UTC

Just a catch-up, there's a fuller explanation in Number Crunching: while I can now submit jobs again, we're chasing some bugs that make it inefficient to run lots of jobs. So, I'm running small batches of short jobs from time to time to get feedback for the developers. If you have a spare CPU to keep a VM task running, fine -- you may pick up a short job from time to time.
Rest assured, I will let you know slightly in advance when I can release a large batch of jobs again.
ID: 858 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 860 - Posted: 27 Aug 2015, 20:19:42 UTC

I'm about to do a mini-blitz of short jobs so that Laurence and Hassen can do some monitoring. I'm hoping it will last into tomorrow, but not much further as there's a separate tack to be taken that requires me to be promptly monitoring as soon as the run starts. I'm not sure I can do that from home with the mini-blitz, so feel free to suck in as many of these jobs as you can, to get them finished overnight.
ID: 860 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 861 - Posted: 27 Aug 2015, 20:27:30 UTC - in response to Message 860.  

Thanks, Ivan, for letting us know!
ID: 861 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 862 - Posted: 27 Aug 2015, 20:41:26 UTC - in response to Message 861.  
Last modified: 27 Aug 2015, 20:53:40 UTC

Thanks, Ivan, for letting us know!

There's a problem with the CRAB server submitting the jobs, it's failing with no error message. It's done this before, usually 2 or 3 retries gets through. I'll keep trying until midnight London time.
[Edit] "Houston, we have submission!" Enjoy! [/Edit]
[Edit^2] There seem to be about 47 active machines at the moment. :-) [/Edit^2]
ID: 862 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,772,298
RAC: 17,481
Message 863 - Posted: 27 Aug 2015, 20:59:54 UTC - in response to Message 862.  

Ooh, something is keeping my machines busy :-)
ID: 863 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,920,398
RAC: 2,619
Message 864 - Posted: 27 Aug 2015, 21:04:24 UTC - in response to Message 863.  

Ooh, something is keeping my machines busy :-)

Glad you like it; latest status has 65 66 machines running. As I said, these will be relatively short jobs, for a fast processor probably less than 10 minutes each. Not greatly efficient, but we need to grab some logs.
ID: 864 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,772,298
RAC: 17,481
Message 865 - Posted: 27 Aug 2015, 21:10:20 UTC - in response to Message 864.  

The console is showing >90% for cmsrun but the logs are still all timed from early this afternoon on the one in front of me !
ID: 865 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 866 - Posted: 27 Aug 2015, 21:11:46 UTC - in response to Message 865.  

Did you do a refresh on your browser?
ID: 866 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,772,298
RAC: 17,481
Message 867 - Posted: 27 Aug 2015, 21:16:00 UTC - in response to Message 866.  

I looked at the console first and hadn't been to the logs today.

But just to confirm, I have done a refresh and they are still only showing times before 2pm.
ID: 867 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : News : Some jobs again


©2024 CERN