Thread 'Some jobs again'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 839 - Posted: 25 Aug 2015, 16:37:29 UTC We can now submit jobs again. I'll submit a test batch overnight, and then try for a bigger test for the rest of the week. Feel free to start running tasks again, and report problems (and successes...) in the usual places. Thanks. ID: 839 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,575,014 RAC: 8,689	Message 842 - Posted: 25 Aug 2015, 17:28:19 UTC - in response to Message 839. Is this a Problem or not: It has continued with 6th, 7th and more records ID: 842 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 843 - Posted: 25 Aug 2015, 20:11:31 UTC - in response to Message 842. I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare. Let us know if it continues. ID: 843 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 846 - Posted: 25 Aug 2015, 20:39:37 UTC Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch! As far as I can tell, 11 have yet to complete, the rest are being resubmitted twice more for reasons that are high on our bug-finding list. I'm not overly concerned about this as yet, as some will have struck problems and need re-running anyway. What I'm not sure about is if a job gets successfully run more than once, is the timestamp I see from the first or the last result to arrive. There are two things I want to compare in my first test (the abortive "challenge" from last week); a) The cululative percentage of results received as a function of time, and b) A comparison of some key data-quality parameters between the CMS@Home and the normal CRAB results (they should be statistically the same, and in fact job-for-job identical for jobs with the same index number). The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... ID: 846 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 847 - Posted: 25 Aug 2015, 21:03:57 UTC Last modified: 25 Aug 2015, 21:50:48 UTC Just caught this. Looks like a problem somewhere. Edit. Further output:- It's now running OK. The following job is producing the same output. ID: 847 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 848 - Posted: 25 Aug 2015, 21:22:36 UTC - in response to Message 846. The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... So do you want me to turn some machines off or turn some more on ? :-) ID: 848 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 849 - Posted: 25 Aug 2015, 22:55:05 UTC - in response to Message 847. Basically, I expect things like that. I'm pushing into more speculative areas because there'll always be someone who says. "But we've already done that, why repeat it?" As I've said earlier, some aspect of this simulations aren't perfect; the one's I'm particularly interested in are much more advanced. ID: 849 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 850 - Posted: 25 Aug 2015, 23:17:26 UTC - in response to Message 848. The first figure usually approaches very close to 100% these days for the normal GRID, within a short time; the results I've seen for CMS@Home in the last week suggest that, even with the relatively limited number of volunteers, we are approaching completion results and times seen in the early days of the GRID, maybe 8 or 10 years ago. Interesting... So do you want me to turn some machines off or turn some more on ? :-) OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000). Since one of the figures of merit is how fast we can get to 90+% of submitted jobs being returned, yeah, more is better. But wait until I say "Go!" so you're not just spinning your wheels. Other things like running more than one VM per host still have to wait -- at the moment there are still sceptics to win around -- but hopefully I can get enough data to say a) if we can get this turnaround, imagine what it would be like with 10, 20, 50X more volunteers (x VMs), and b) the returned results are statistically insignificant from normal GRID jobs. After that potential burst, we'll probably go a bit quiet again, concentrating on procedures rather than throughput. At that stage you can obviously throttle back. And I guess at this stage I should reiterate -- Thanks to all for participating, we appreciate it and we hope you understand that we still operate under constraints and bugs yet to fix, yet to find. I can't put a timeline on when we can go into "production" -- in part that depends on the sceptics -- but I'll be disappointed if I'm still running tests into the New Year. Best, ivan ID: 850 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 851 - Posted: 25 Aug 2015, 23:25:38 UTC - in response to Message 846. Last modified: 25 Aug 2015, 23:26:07 UTC Well, that was quick! We already have 85 results returned from the 100 jobs I submitted in my test batch! ...and as of now we have 100% results returned! In about 7.5 hours. Congratulations. You'll still get resubs for a while while we sort out why CRAB isn't telling Condor that jobs have successfully finished. ID: 851 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 852 - Posted: 25 Aug 2015, 23:30:24 UTC - in response to Message 850. Last modified: 25 Aug 2015, 23:31:24 UTC OK, if all goes well now, I'll release a large number of jobs late tomorrow, of the order of 10X tonight's test (so, probably 1000 jobs -- maybe 2000). Since one of the figures of merit is how fast we can get to 90+% of submitted jobs being returned, yeah, more is better. But wait until I say "Go!" so you're not just spinning your wheels. Other things like running more than one VM per host still have to wait -- at the moment there are still sceptics to win around -- but hopefully I can get enough data to say a) if we can get this turnaround, imagine what it would be like with 10, 20, 50X more volunteers (x VMs), and b) the returned results are statistically insignificant from normal GRID jobs. After that potential burst, we'll probably go a bit quiet again, concentrating on procedures rather than throughput. At that stage you can obviously throttle back. And I guess at this stage I should reiterate -- Thanks to all for participating, we appreciate it and we hope you understand that we still operate under constraints and bugs yet to fix, yet to find. I can't put a timeline on when we can go into "production" -- in part that depends on the sceptics -- but I'll be disappointed if I'm still running tests into the New Year. Best, ivan Sounds promising. My wheels don't spin needlessly, I also run DPAD which picks up the slack when either CMS (or Atlas) aren't maxing out their cores :-) Not too concerned about the constraints and bugs yet to fix, more than happy with the level of support and interaction with the 'development team' to getting problems resolved. Something some of the other CERN teams could work on. ID: 852 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 853 - Posted: 26 Aug 2015, 9:22:36 UTC - in response to Message 843. I think not, but I should consult the experts. There occasions when (as I understand it) particle track generation goes off-kilter -- it's the nature of limited-precision floating point number amongst other uncertainties. I think that's saying that the procedure that interpolates the magnetic field from the field-maps is unable to determine the magnetic field at x=-82 cm, y=-298 cm and z=-640 cm. Z is along the beamline, X & Y are orthogonal. Since CMS is 21.6 m long (i.e. Z=+/- 10.8 m) that doesn't seem an infeasible coordinate. I'm surprised you've seen more of them, it should be rare After a bit of searching, I don't think it's a problem for CMS@Home. The magnetic solenoid is 13 m long (internal diameter 6 m) so the point should be just inside one end. I'll go with my theory that floating-point inaccuracies led to a bizarre particle trajectory. ID: 853 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 858 - Posted: 27 Aug 2015, 0:18:30 UTC Just a catch-up, there's a fuller explanation in Number Crunching: while I can now submit jobs again, we're chasing some bugs that make it inefficient to run lots of jobs. So, I'm running small batches of short jobs from time to time to get feedback for the developers. If you have a spare CPU to keep a VM task running, fine -- you may pick up a short job from time to time. Rest assured, I will let you know slightly in advance when I can release a large batch of jobs again. ID: 858 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 860 - Posted: 27 Aug 2015, 20:19:42 UTC I'm about to do a mini-blitz of short jobs so that Laurence and Hassen can do some monitoring. I'm hoping it will last into tomorrow, but not much further as there's a separate tack to be taken that requires me to be promptly monitoring as soon as the run starts. I'm not sure I can do that from home with the mini-blitz, so feel free to suck in as many of these jobs as you can, to get them finished overnight. ID: 860 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 861 - Posted: 27 Aug 2015, 20:27:30 UTC - in response to Message 860. Thanks, Ivan, for letting us know! ID: 861 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 862 - Posted: 27 Aug 2015, 20:41:26 UTC - in response to Message 861. Last modified: 27 Aug 2015, 20:53:40 UTC Thanks, Ivan, for letting us know! There's a problem with the CRAB server submitting the jobs, it's failing with no error message. It's done this before, usually 2 or 3 retries gets through. I'll keep trying until midnight London time. [Edit] "Houston, we have submission!" Enjoy! [/Edit] [Edit^2] There seem to be about 47 active machines at the moment. :-) [/Edit^2] ID: 862 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 863 - Posted: 27 Aug 2015, 20:59:54 UTC - in response to Message 862. Ooh, something is keeping my machines busy :-) ID: 863 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 864 - Posted: 27 Aug 2015, 21:04:24 UTC - in response to Message 863. Ooh, something is keeping my machines busy :-) Glad you like it; latest status has 65 66 machines running. As I said, these will be relatively short jobs, for a fast processor probably less than 10 minutes each. Not greatly efficient, but we need to grab some logs. ID: 864 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 865 - Posted: 27 Aug 2015, 21:10:20 UTC - in response to Message 864. The console is showing >90% for cmsrun but the logs are still all timed from early this afternoon on the one in front of me ! ID: 865 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 866 - Posted: 27 Aug 2015, 21:11:46 UTC - in response to Message 865. Did you do a refresh on your browser? ID: 866 · Rating: 0 · rate: / Reply Quote

PDW Send message Joined: 20 May 15 Posts: 217 Credit: 6,294,052 RAC: 0	Message 867 - Posted: 27 Aug 2015, 21:16:00 UTC - in response to Message 866. I looked at the console first and hadn't been to the logs today. But just to confirm, I have done a refresh and they are still only showing times before 2pm. ID: 867 · Rating: 0 · rate: / Reply Quote

Development for LHC@home