Message boards : Number crunching : CMS doesn't crunch
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
My ALT/F4 from the Laptop: ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 45 ![]() ![]() |
Sorry, but one more: Interesting. I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . . |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
I will give this a try when I see the box crunching again Thanks |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 ![]() |
I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . . That (and the http logs) changed when Condor was introduced instead of the previous mechanism. |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 45 ![]() ![]() |
I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . . ALT+F4 shows now the contents of localhost machine log: runGlideinout and ALT+F5 now shows info about SCRAM setup and the CMSrun also available in machine log cmsRun-stdout.log, but the display seems to be ahead of the contents of the log. It looks like the logfile contains the info about the last CMSrun. |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
Yeah, found cmsRun-stdout.log and there can be seen: Begin processing the 98th record. Run 1, Event 18298, LumiSection 183 at 15-Aug-2015 11:48:08.481 CEST Begin processing the 99th record. Run 1, Event 18299, LumiSection 183 at 15-Aug-2015 11:48:09.688 CEST Begin processing the 100th record. Run 1, Event 18300, LumiSection 183 at 15-Aug-2015 11:48:13.858 CEST |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
Okay, one more: When I today came back to Computers, the Desktop was crunching CMS, but the Laptop was sitting there idle. I played a bit around but it didn't get back to fetching (or getting ?) any work. As the WU had already 12 hours I resetted the whole Project and now, my Laptop looks as if it is fetching work again |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
Now both Computers are out of work. So, I guess, now the work-Queue is empty |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
Now both Computers are out of work. OK, I'll check. [Edit] Yes, the well was dry. I've submitted another batch of 500 jobs, these will take 4x longer than the last lot, hopefully that'll satisfy your hunger for a day or two. I just have to wait a little while until Condor creates all the instances, and then change their characteristics to suit our requirements. Happy crunching! [/Edit] ![]() |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 211 ![]() ![]() |
...I've submitted another batch of 500 jobs, these will take 4x longer than the last lot... Oh, dear. I hope that this doesn't mean that each individual CMS job will take longer. I'm sure that many hosts, like mine, don't run continuously. It seems that CMS jobs aren't saved when the host is shut down. Recent wrapper versions allow VMs to be saved and vLHC ones are, so it can be done, unless they're about to mess things up. So, unless the host runs long enough to complete a CMS job in one go, as it were, it doesn't actually get anywhere. Is this how it works? if so, are there plans to implement a "VM saving" arrangement? I've a nagging thought that a lot of time (and electricity) is being wasted. |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
Each job took 8 to 40 Minutes in the last batch |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
...I've submitted another batch of 500 jobs, these will take 4x longer than the last lot... Hmm, I must admit I'm not exactly sure about the implications of what you're describing. I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it. To re-iterate my best understanding of how we operate: Each participating machine gets one (to be increased later, preferably owner-specified) "task" that runs for ~24 hours. Each task polls for "jobs" (which are what I have recently generated) which run within the task; as each job finishes it reports the results and requests a new job. After the task's lifetime is over, it closes down and if BOINC allows, a new task begins. Because there is unavoidable overhead in starting up each task, and also in starting/ending each job, it's obviously more efficient to have longer jobs -- the ultimate would be to aim for ~23-hour jobs, but the disparity in host machines makes that impracticable. Some simple subset would probably suffice, say 5 hours for the fastest machine would allow four completed jobs/task, while a slower machine would return three or two; you're never going to achieve 100% efficiency! I must also admit, I'm hazy about what happens if a job is unfinished when the task wants to terminate; does it just finish, or wait for the job to complete? I need some input from our CERN collaborators here. On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed. I should note that we are also considering some tangible "prizes" for successful contributors -- more tangible than SETI@Home's toasters! T-shirts, baseball caps, perhaps a visit to CERN itself -- it's still a bit up in the air, but watch out for it! About now I should add some philosphy -- why run CMS@Home at all? After all, we have a perfectly good WLCG "GRID" to run thousands of jobs on millions of cores around the world. The thing is, these are for well-known phenomena; IIRC there are some jobs that don't get a look-in, e.g. the guidelines don't allow jobs where the expected result is less than one in 10^8 (as someone who has run more Monte Carlo simulations than most of humanity, I quibble with that but, whatever...). So one aim of CMS@Home, as presented to me as an inducement to become its public spokesman, is to allow these kinds of "blue-skies" explorations to be carried out. When we go mainstream, expect the type of jobs to change. ![]() |
![]() Send message Joined: 29 May 15 Posts: 158 Credit: 2,914,375 RAC: 2,086 ![]() ![]() ![]() |
Ivan, a very good descrition, thank you very much for this. Hmm, I must admit I'm not exactly sure about the implications of what you're describing. I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it. I'm shure there are some glitches at the Moment, but as it works fine at vLHC it will be no big thing to eliminate them. On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed It would be a milestone if you could change this; at the moment a lot of people only care about their BOINC-Credits and never take a look, what is really happening inside the VM. So, a lot of People get credit for having really crunched never a real Job from the Project. I have seen Users crying at Atlas, why they don't get credit, they were sitting behind a Firewall and there VMs couldn't reach CERN through the Firewall. Atlas can recognize this, but vLHC doesn't (or didn't) and so they get the Feeling at vLHC that all is fine. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 211 ![]() ![]() |
Thanks,Ivan. I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it. As I understand it,there are three ways to shut things down, apart from simply turning the power off. 1. Shut down the PC (As Windows Start/shutdown/OK or whatever) leaving the OS to close everything in an orderly manner. Probably what most "normal" volunteers would do. 2. Close BOINC from the GUI. (advanced/shut down connected client/...) 3. Use boinccmd --quit command. From memory, they all leave the VM in a different state. Maybe it depends on whether the task is still memory (the "leave tasks in memory" option) Here, things are normally shut down using boinccmd, run as a Linux cron job or Windows scheduled task at a specific time. The PC is then shut down by another job a few minutes later. I can't readily see what the state of the VM is 'cos everything starts up again when the PC boots, but using the vLHC console (2?) their events resume seamlessly. There doesn't now seem to be an equivalent real time display from CMS, but from what I've seen it appears that CMS goes through the start up process, or a lot of it, again. I could easily be wrong, most of what scrolls by doesn't make as much sense as I would like, and I, for one,would really like a "real time" display of job progress, as with vLHC. In any case, if jobs run for less than an hour, it's not really a problem.
Probably abandoned, this is what vLHC does. On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed. That seems OK, after all we make resources available, if the project chooses to waste tham by taking an hour to start work or abandoning work in progress at some arbitrary point, that shouldn't be the volunteers' problem. If this was changed, say to jobs successfully completed, then credit hunting volunteers would switch to projects where all their resources were translated into credit, wouldn't they? That would then bring pressure on the project to make better use of donated resources... maybe. About now I should add some philosphy -- why run CMS@Home at all? After all, we have a perfectly good WLCG "GRID" to run thousands of jobs on millions of cores around the world. The thing is, these are for well-known phenomena; IIRC there are some jobs that don't get a look-in, e.g. the guidelines don't allow jobs where the expected result is less than one in 10^8 (as someone who has run more Monte Carlo simulations than most of humanity, I quibble with that but, whatever...). So one aim of CMS@Home, as presented to me as an inducement to become its public spokesman, is to allow these kinds of "blue-skies" explorations to be carried out. When we go mainstream, expect the type of jobs to change. Is that question for us volunteers or for you, the users of the "product". For us, there are probably nearly as many reasons and combinations as volunteers. Once someone tried to find this out from the, then, T4T crunchers; don't know if there was a result. Now, where have those sheep gone? |
Send message Joined: 13 Feb 15 Posts: 1252 Credit: 995,923 RAC: 45 ![]() ![]() |
Yes, the well was dry. I've submitted another batch of 500 jobs, these will take 4x longer than the last lot... The first of that longer jobs I got, ran 4 hours minus 7 minutes. I had to extend BOINC's task lifetime to 90,857 seconds instead of the default 86400 seconds (24hrs), cause else the job would have been killed by BOINC ending his task. To your previous post: When a task is suspended (leave application in memory off) or BOINC is stopped, the Virtual Machine is saved. However I already saw several times that the state of the CMS-VM was not saved, but stopped. Maybe the save of the >3GB VM is lasting too long and BOINC is not patient enough. Resuming such a task means a reboot of the VM and all work done before is lost. When you want to run longer tasks in the future, consider to create a BOINC-task for every CMS-job. I know the overhead, but you could create a little bigger master VM with the most needed files in it. With BOINC killing every 24 hours a job, volunteers have to accept that about 10% on average of the CPU-time is wasted. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0 ![]() |
|
©2025 CERN