Message boards : Number crunching : CMS doesn't crunch
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 573 - Posted: 14 Aug 2015, 20:07:59 UTC

My ALT/F4 from the Laptop:

ID: 573 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1152
Credit: 8,310,612
RAC: 0
Message 574 - Posted: 14 Aug 2015, 20:59:41 UTC - in response to Message 573.  

Can't debug/verify from home, unfortunately -- less than 3 Mbps on my broadband at the moment!
ID: 574 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1252
Credit: 995,923
RAC: 45
Message 575 - Posted: 14 Aug 2015, 21:33:51 UTC - in response to Message 572.  

Sorry, but one more:

The VMs seems to differ from my Host to Host.


Interesting.

I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . .
ID: 575 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 576 - Posted: 14 Aug 2015, 21:55:54 UTC - in response to Message 571.  


Maybe you'll find some information in the CMS@home machine logs, accessible by BOINC Manager.
Highlight the CMS-dev task and press the 'Show graphics' button on the left.

I will give this a try when I see the box crunching again

Thanks
ID: 576 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 578 - Posted: 14 Aug 2015, 22:52:25 UTC - in response to Message 575.  
Last modified: 14 Aug 2015, 22:56:03 UTC

I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . .

That (and the http logs) changed when Condor was introduced instead of the previous mechanism.
ID: 578 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1252
Credit: 995,923
RAC: 45
Message 582 - Posted: 15 Aug 2015, 7:32:26 UTC - in response to Message 578.  
Last modified: 15 Aug 2015, 7:34:47 UTC

I noticed that several days ago ALT+F4 and ALT+F5 suddenly didn't show anything where it did before, but thought it was by purpose and not a bug . . .

That (and the http logs) changed when Condor was introduced instead of the previous mechanism.

ALT+F4 shows now the contents of localhost machine log: runGlideinout and
ALT+F5 now shows info about SCRAM setup and the CMSrun also available in machine log cmsRun-stdout.log, but the display seems to be ahead of the contents of the log.
It looks like the logfile contains the info about the last CMSrun.
ID: 582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 583 - Posted: 15 Aug 2015, 13:44:24 UTC - in response to Message 576.  


Maybe you'll find some information in the CMS@home machine logs, accessible by BOINC Manager.
Highlight the CMS-dev task and press the 'Show graphics' button on the left.

I will give this a try when I see the box crunching again

Thanks

Yeah, found cmsRun-stdout.log and there can be seen:

Begin processing the 98th record. Run 1, Event 18298, LumiSection 183 at 15-Aug-2015 11:48:08.481 CEST
Begin processing the 99th record. Run 1, Event 18299, LumiSection 183 at 15-Aug-2015 11:48:09.688 CEST
Begin processing the 100th record. Run 1, Event 18300, LumiSection 183 at 15-Aug-2015 11:48:13.858 CEST
ID: 583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 584 - Posted: 15 Aug 2015, 14:15:05 UTC

Okay, one more:

When I today came back to Computers, the Desktop was crunching CMS, but the Laptop was sitting there idle.

I played a bit around but it didn't get back to fetching (or getting ?) any work.

As the WU had already 12 hours I resetted the whole Project and now, my Laptop looks as if it is fetching work again
ID: 584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 586 - Posted: 15 Aug 2015, 21:27:32 UTC

Now both Computers are out of work.

So, I guess, now the work-Queue is empty
ID: 586 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1152
Credit: 8,310,612
RAC: 0
Message 589 - Posted: 16 Aug 2015, 8:52:30 UTC - in response to Message 586.  
Last modified: 16 Aug 2015, 9:03:39 UTC

Now both Computers are out of work.

So, I guess, now the work-Queue is empty

OK, I'll check.

[Edit] Yes, the well was dry. I've submitted another batch of 500 jobs, these will take 4x longer than the last lot, hopefully that'll satisfy your hunger for a day or two. I just have to wait a little while until Condor creates all the instances, and then change their characteristics to suit our requirements. Happy crunching! [/Edit]
ID: 589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 901,716
RAC: 211
Message 590 - Posted: 16 Aug 2015, 9:34:03 UTC - in response to Message 589.  
Last modified: 16 Aug 2015, 9:39:57 UTC

...I've submitted another batch of 500 jobs, these will take 4x longer than the last lot...

Oh, dear. I hope that this doesn't mean that each individual CMS job will take longer. I'm sure that many hosts, like mine, don't run continuously. It seems that CMS jobs aren't saved when the host is shut down. Recent wrapper versions allow VMs to be saved and vLHC ones are, so it can be done, unless they're about to mess things up. So, unless the host runs long enough to complete a CMS job in one go, as it were, it doesn't actually get anywhere.
Is this how it works? if so, are there plans to implement a "VM saving" arrangement? I've a nagging thought that a lot of time (and electricity) is being wasted.
ID: 590 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 591 - Posted: 16 Aug 2015, 9:42:54 UTC - in response to Message 590.  

Each job took 8 to 40 Minutes in the last batch
ID: 591 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1152
Credit: 8,310,612
RAC: 0
Message 592 - Posted: 16 Aug 2015, 10:44:50 UTC - in response to Message 590.  

...I've submitted another batch of 500 jobs, these will take 4x longer than the last lot...

Oh, dear. I hope that this doesn't mean that each individual CMS job will take longer. I'm sure that many hosts, like mine, don't run continuously. It seems that CMS jobs aren't saved when the host is shut down. Recent wrapper versions allow VMs to be saved and vLHC ones are, so it can be done, unless they're about to mess things up. So, unless the host runs long enough to complete a CMS job in one go, as it were, it doesn't actually get anywhere.
Is this how it works? if so, are there plans to implement a "VM saving" arrangement? I've a nagging thought that a lot of time (and electricity) is being wasted.

Hmm, I must admit I'm not exactly sure about the implications of what you're describing. I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it.

To re-iterate my best understanding of how we operate: Each participating machine gets one (to be increased later, preferably owner-specified) "task" that runs for ~24 hours. Each task polls for "jobs" (which are what I have recently generated) which run within the task; as each job finishes it reports the results and requests a new job. After the task's lifetime is over, it closes down and if BOINC allows, a new task begins.
Because there is unavoidable overhead in starting up each task, and also in starting/ending each job, it's obviously more efficient to have longer jobs -- the ultimate would be to aim for ~23-hour jobs, but the disparity in host machines makes that impracticable. Some simple subset would probably suffice, say 5 hours for the fastest machine would allow four completed jobs/task, while a slower machine would return three or two; you're never going to achieve 100% efficiency!
I must also admit, I'm hazy about what happens if a job is unfinished when the task wants to terminate; does it just finish, or wait for the job to complete? I need some input from our CERN collaborators here.

On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed. I should note that we are also considering some tangible "prizes" for successful contributors -- more tangible than SETI@Home's toasters! T-shirts, baseball caps, perhaps a visit to CERN itself -- it's still a bit up in the air, but watch out for it!

About now I should add some philosphy -- why run CMS@Home at all? After all, we have a perfectly good WLCG "GRID" to run thousands of jobs on millions of cores around the world. The thing is, these are for well-known phenomena; IIRC there are some jobs that don't get a look-in, e.g. the guidelines don't allow jobs where the expected result is less than one in 10^8 (as someone who has run more Monte Carlo simulations than most of humanity, I quibble with that but, whatever...). So one aim of CMS@Home, as presented to me as an inducement to become its public spokesman, is to allow these kinds of "blue-skies" explorations to be carried out. When we go mainstream, expect the type of jobs to change.
ID: 592 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 158
Credit: 2,914,375
RAC: 2,086
Message 595 - Posted: 16 Aug 2015, 11:47:58 UTC

Ivan, a very good descrition, thank you very much for this.

Hmm, I must admit I'm not exactly sure about the implications of what you're describing. I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it.


I'm shure there are some glitches at the Moment, but as it works fine at vLHC it will be no big thing to eliminate them.

On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed


It would be a milestone if you could change this; at the moment a lot of people only care about their BOINC-Credits and never take a look, what is really happening inside the VM.

So, a lot of People get credit for having really crunched never a real Job from the Project. I have seen Users crying at Atlas, why they don't get credit, they were sitting behind a Firewall and there VMs couldn't reach CERN through the Firewall. Atlas can recognize this, but vLHC doesn't (or didn't) and so they get the Feeling at vLHC that all is fine.
ID: 595 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 901,716
RAC: 211
Message 599 - Posted: 16 Aug 2015, 13:10:49 UTC - in response to Message 592.  
Last modified: 16 Aug 2015, 13:21:28 UTC

Thanks,Ivan.

I'd always thought that task states were saved; if that's changed I wasn't aware of it. On reflection, I'm sure they are; we used to checkpoint regularly, but changed to a "save state when closed" model -- as long as BOINC is expicitly closed rather than the machine shut down peremptorily beneath it.

As I understand it,there are three ways to shut things down, apart from simply turning the power off.

1. Shut down the PC (As Windows Start/shutdown/OK or whatever) leaving the OS to close everything in an orderly manner. Probably what most "normal" volunteers would do.

2. Close BOINC from the GUI. (advanced/shut down connected client/...)

3. Use boinccmd --quit command.

From memory, they all leave the VM in a different state. Maybe it depends on whether the task is still memory (the "leave tasks in memory" option)

Here, things are normally shut down using boinccmd, run as a Linux cron job or Windows scheduled task at a specific time. The PC is then shut down by another job a few minutes later. I can't readily see what the state of the VM is 'cos everything starts up again when the PC boots, but using the vLHC console (2?) their events resume seamlessly.

There doesn't now seem to be an equivalent real time display from CMS, but from what I've seen it appears that CMS goes through the start up process, or a lot of it, again. I could easily be wrong, most of what scrolls by doesn't make as much sense as I would like, and I, for one,would really like a "real time" display of job progress, as with vLHC. In any case, if jobs run for less than an hour, it's not really a problem.



I must also admit, I'm hazy about what happens if a job is unfinished when the task wants to terminate; does it just finish, or wait for the job to complete? I need some input from our CERN collaborators here.


Probably abandoned, this is what vLHC does.

On a practical note, because some people care about metrics, to the best of my knowledge we allocate credits based on the number of tasks run, not the number of jobs processed.

That seems OK, after all we make resources available, if the project chooses to waste tham by taking an hour to start work or abandoning work in progress at some arbitrary point, that shouldn't be the volunteers' problem. If this was changed, say to jobs successfully completed, then credit hunting volunteers would switch to projects where all their resources were translated into credit, wouldn't they? That would then bring pressure on the project to make better use of donated resources... maybe.


About now I should add some philosphy -- why run CMS@Home at all? After all, we have a perfectly good WLCG "GRID" to run thousands of jobs on millions of cores around the world. The thing is, these are for well-known phenomena; IIRC there are some jobs that don't get a look-in, e.g. the guidelines don't allow jobs where the expected result is less than one in 10^8 (as someone who has run more Monte Carlo simulations than most of humanity, I quibble with that but, whatever...). So one aim of CMS@Home, as presented to me as an inducement to become its public spokesman, is to allow these kinds of "blue-skies" explorations to be carried out. When we go mainstream, expect the type of jobs to change.


Is that question for us volunteers or for you, the users of the "product". For us, there are probably nearly as many reasons and combinations as volunteers.
Once someone tried to find this out from the, then, T4T crunchers; don't know if there was a result.

Now, where have those sheep gone?
ID: 599 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1252
Credit: 995,923
RAC: 45
Message 600 - Posted: 16 Aug 2015, 13:58:58 UTC - in response to Message 589.  

Yes, the well was dry. I've submitted another batch of 500 jobs, these will take 4x longer than the last lot...

The first of that longer jobs I got, ran 4 hours minus 7 minutes.
I had to extend BOINC's task lifetime to 90,857 seconds instead of the default 86400 seconds (24hrs), cause else the job would have been killed by BOINC ending his task.

To your previous post:
When a task is suspended (leave application in memory off) or BOINC is stopped, the Virtual Machine is saved.
However I already saw several times that the state of the CMS-VM was not saved, but stopped. Maybe the save of the >3GB VM is lasting too long and BOINC is not patient enough.
Resuming such a task means a reboot of the VM and all work done before is lost.
When you want to run longer tasks in the future, consider to create a BOINC-task for every CMS-job.
I know the overhead, but you could create a little bigger master VM with the most needed files in it.
With BOINC killing every 24 hours a job, volunteers have to accept that about 10% on average of the CPU-time is wasted.
ID: 600 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profileivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1152
Credit: 8,310,612
RAC: 0
Message 608 - Posted: 17 Aug 2015, 8:43:28 UTC - in response to Message 600.  

CP, I take your point. I'll make jobs a bit shorter in future.
ID: 608 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Number crunching : CMS doesn't crunch


©2025 CERN