Message boards : Number crunching : CMS doesn't crunch
Message board moderation
Author | Message |
---|---|
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Hi, as already mentioned in the News-Thread: Okay, I checked around and got the Feeling, CMS should work and do something usefull now ? I played a bit around, cancelled the running Job after 6 hours (it was still doing really nothing), aborted the Task and fetched a new WU. Booting the new VM, I could make following snapshot: ![]() Don't know if this really matters, but until now (12 minutes) the CMS-VM seems still not to be crunching anything |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
By the way, I'm behind a strong Firewall and you have set up a new config / technic to use. Does it work within these mentioned ports und IPs from http://lhcathome.web.cern.ch/faq
|
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Are you getting tasks since I submitted more jobs? The screenshot you posted is familiar to me, it does continue after a while. There are lots of delays in the startup processes, mainly to do with downloading files, filling the cvmfs cache, etc. When you start throwing in time-out delays before switching to secondary servers, etc., it is a bit glacial at times ![]() |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Just checked but it looks as if nothing is crunching For now I have to stop tomorrow we can go on |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Well, something seems to be running Condor jobs, but just a few of them so far. ![]() |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Well, something seems to be running Condor jobs, but just a few of them so far. OK, they've been running overnight: ![]() At the moment, I think there are only 45 Condor slots available; as well each job is retried twice after failure. For a more detailed analysis, here's a snapshot from CMS Dashboard (.pdf). ![]() |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 ![]() |
Well, something seems to be running Condor jobs, but just a few of them so far. I've got 2 machines today (Thursday) running cmsRun jobs. Took about 15 mins from bootup and now doing jobs taking 30-50 mins. |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Okay, so far I'm out of ideas :-( I have updated my VB to 4.3.30, aborted the old CMS-dev-WU, downloaded a new one but it keeps to be still the same. I gave it time up to 30 / 40 minutes to start crunching but nothing happens. I have opened a big hole in my Firewall to check if the Firewall is blocking but nothing helps :-( Are there still Jobs in the Queue ? On which Screen ALT/F? should I see a crunching WU ? ALT/F1 says: Starting vmcontext_epilog ............ bootlogd: no process killed ALT/F2 says: sh ............ CMSJobAgent.sh ............ python ............ wget ALT/F3 says: normal Task overview ALT/F4 /F5 : blank Screen ATL/F6 says: welcome to CERN Virtual Machine Version 3.3.0.20 ............ ... ............ localhost Login ALT/F7 - F9: nothing ALT/F10 says: welcome to CERN Virtual Machine Version 3.3.0.20 ............ ... ............ Instance pairing pin: |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Okay, so far I'm out of ideas :-(Last time I looked there were 999 or so. However, we only have about 45 slots to serve them out and there are about 90 machines asking for jobs... On which Screen ALT/F? should I see a crunching WU ?That would usually be top; if you type 'u' then 'boinc ALT/F6 says: welcome to CERN Virtual Machine Version 3.3.0.20You can log-in there if you can suss the password... :-) ![]() |
Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0 ![]() |
However, we only have about 45 slots to serve them out and there are about 90 machines asking for jobs... My PS display shows either Nothing, or glidin_startup with "sleep" for the past few hours, and loadav of 0.00, 0.00, 0.01 [edit] Wow, as I type its found another job, did you kick something?[/edit] |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
[edit]Wow, as I type its found another job, did you kick something?[/edit]No, you probably got lucky and struck a ready slot. I'll ask Andrew tomorrow if we can increase them. As I'm sure you know, we're really still in an alpha, or at least pre-beta, stage, and didn't expect this level of interest before we officially went public. I am pleased at the progress the development team has made lately though. I have a meeting in four weeks where I'd like to present significant results from CMS@Home compared to normal GRID jobs. Hopefully, summer holidays permitting, we might make that milestone. We must also look at the "Server Status" page -- the job backlog there bears no resemblance to what I believe is the actual situation. Historical baggage not updated perhaps? ![]() |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Okay, on my Laptop, I installed VB 4.3.28 and from home it worked immediatly, I could crunch my first real CMS-WU. On my Desktop in the Office, I resetted the CMS-Project, but that didn't help. So I can track it down to two Points: VirtualBox 4.3.30 Firewall Tomorrow I will first try with VB 4.3.28 |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
you probably got lucky and struck a ready slot. I'll ask Andrew tomorrow if we can increase them. As I'm sure you know, we're really still in an alpha, or at least pre-beta, stage, and didn't expect this level of interest before we officially went public. Looks like I'd misunderstood -- the number of job slots isn't fixed, so since running jobs < active tasks (according to ServerStatus) there must be other reasons for tasks to sit idle. :-( ![]() |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Okay, I think I finally got it ! Make your guess, was it VB 4.3.30 or Firewall ? Right, it has been a spread of IPs and Ports that are not announced by the official network-Statement from CERN; I will make a separte thread about this. I had to open several ports and IPs in the Firewall, but now it Looks as if both boxes are doing fine. Now I need an Admin form CMS, that can check, if my boxes are doing really fine now and you get back all you need from them ? Thanks in advance By the way: It would be nice to get a link where we can check this ourselves like "MCPLOTS Stats" from vLHC |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Now I need an Admin form CMS, that can check, if my boxes are doing really fine now and you get back all you need from them ?We're working on that; I'm told it will be "real soon now". By the way: It would be nice to get a link where we can check this ourselves like "MCPLOTS Stats" from vLHCI guess we'll work on that too, but let's take things one step at a time. ![]() |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Now I need an Admin form CMS, that can check, if my boxes are doing really fine now and you get back all you need from them ?We're working on that; I'm told it will be "real soon now". Could you check this once for me ? I would like to ensure that I found all needed Ports and IPs |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Now I need an Admin form CMS, that can check, if my boxes are doing really fine now and you get back all you need from them ?We're working on that; I'm told it will be "real soon now". The only way I know (at the moment) is to wait for your jobs to finish and look at the stderr; Laurence may know some magic incantations to identify jobs-in-progress. If your "top" window is showing cmsRun at ~100% CPU for 15 minutes or so (perhaps more depending on what jobs are in the queue) and then nothing for ~10 minutes, then running again, you are at least getting jobs and running them successfully. ![]() |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Okay, it seems as if we are out of Jobs now. Both machines sit here idle ------------------------------------------ One more for the ToDo-List: I suspended my boinc on the Laptop, shut it down and took it again with me home. At home I switched it on, reactivated BOINC but the VM seemed to have gone into Limbo. All Screens only black or no reaction on ALT/Fx Booted it with Head, send CTRL-ALT-DEL and "Switch off via APC"; don't know which comand did it, but the VM shut itself down. Afterwards I could use it as normal in BOINC |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 68 ![]() ![]() |
Now I need an Admin form CMS, that can check, if my boxes are doing really fine now and you get back all you need from them ?We're working on that; I'm told it will be "real soon now". Maybe you'll find some information in the CMS@home machine logs, accessible by BOINC Manager. Highlight the CMS-dev task and press the 'Show graphics' button on the left. |
![]() Send message Joined: 29 May 15 Posts: 147 Credit: 2,842,484 RAC: 0 ![]() ![]() |
Sorry, but one more: The VMs seems to differ from my Host to Host. My Laptop says in ALT/F1: starting VM_context Epilog ... bootlogd: no process killed started CMS Job Agent My Desktop says only in ALT/F1: starting VM_context Epilog ... bootlogd: no process killed My Laptop in ALT/F4 has a Fullscreen Protokoll, but my Desktop shows nothing on ALT/F4 |
©2025 CERN