Message boards : Number crunching : Issues running jobs
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,946,836
RAC: 2,933
Message 615 - Posted: 17 Aug 2015, 12:00:13 UTC

OK, starting a new thread for people to report issues with running, and hopefully get feedback from the crew at the coalface as to what might be going on.

Currently, the Server Status page reports 100 active tasks; on the Condor machine I see 49 of "my" jobs running, so there must be about fifty tasks that have not picked up a job to run. If your task is one of them*, please feel free to report here and ask for help.

* To check, bring up the VM interface (you need the extension pack matching your VirtualBox version to be installed) and go to the "top" display using ALT+F3. You can de-clutter the display considerably by pressing "u" (for user) and then typing "boinc" + carriage return to only display boinc's jobs. If all is well, after 20 minutes or so of starting you should see a task called cmsRun near the top of the display, and after a little while it should be showing close to 100% usage. This will run for some time (the current jobs are a little over-ambitious and may take several hours) before stopping to report results and download a new job, at which point you should see cmsRun working again, until the task times out at around 24 hours.
ID: 615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile PDW

Send message
Joined: 20 May 15
Posts: 217
Credit: 5,876,910
RAC: 16,233
Message 618 - Posted: 17 Aug 2015, 12:41:26 UTC - in response to Message 615.  

I can see the one in front of me (on my laptop) is running fine (>95% cpu usage). Do you need me to look on all machines or just if I think there is a problem and report it then ?
ID: 618 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 619 - Posted: 17 Aug 2015, 13:01:30 UTC - in response to Message 615.  

If your task is one of them*, please feel free to report here and ask for help.

Okay, I Need help with this http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=67

As Long as this isn't fixed I can not use my Laptop and I'm afraid I might loose much crunching power by suspending Tasks (and my Network is configured to suspend when normal Tasks ask for CPU-Power)
ID: 619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 620 - Posted: 17 Aug 2015, 13:33:13 UTC - in response to Message 615.  

Currently, the Server Status page reports 100 active tasks; on the Condor machine I see 49 of "my" jobs running, so there must be about fifty tasks that have not picked up a job to run. If your task is one of them*, please feel free to report here and ask for help.

I suppose those 100 'active' tasks, you mean, come from CMS-dev Server status page (atm 102 in progress)
Those 102 are not 'active' in the way that they are already started on 1 of the BOINC-clients.
They could not have started due to lower resource share compared to other BOINC-projects or by a user who have suspended the task.
Anyway the client has a whole week to finish the 24 hours BOINC-task.

And of course there are users not babysitting their machines and not aware that a started CMS-task isn't doing real work - not using CPU and the VM does not ask/get CMS-jobs.
ID: 620 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 621 - Posted: 17 Aug 2015, 13:37:20 UTC

I have 3 VboxHeadless.exe processes running in windows.One is close to 100% (of one core) utilization, the other 2 have zero run-time. The task has been running for 20h or so without any interruption.
This is no bug as such, but why are these extra 2 processes there?

Is there a way to modify/set the amount of RAM the v-box uses?(i tried, but it is locked)
Would a higher amount improve performance for the project?
ID: 621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 622 - Posted: 17 Aug 2015, 13:46:36 UTC - in response to Message 621.  

I have 3 VboxHeadless.exe processes running in windows.One is close to 100% (of one core) utilization, the other 2 have zero run-time. The task has been running for 20h or so without any interruption.
This is no bug as such, but why are these extra 2 processes there?

I have 5x vBoxHeadless and I run 1x CMS and 4x Atlas

So, I guess the 2 sleeping vBoxHeadless are from other, suspended projects
ID: 622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 624 - Posted: 17 Aug 2015, 14:11:50 UTC - in response to Message 622.  
Last modified: 17 Aug 2015, 14:12:52 UTC

I do not run any other projects with v-box use(nor did i for the past few month).
ID: 624 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 632 - Posted: 17 Aug 2015, 14:56:20 UTC - in response to Message 621.  

I have 3 VboxHeadless.exe processes running in windows.One is close to 100% (of one core) utilization, the other 2 have zero run-time. The task has been running for 20h or so without any interruption.
This is no bug as such, but why are these extra 2 processes there?

Is there a way to modify/set the amount of RAM the v-box uses?(i tried, but it is locked)
Would a higher amount improve performance for the project?

It's how Oracle VirtualBox has setup the running of Virtual Machines.
1 VM use 3 processes. The 3rdchild is doing the real work and uses the most CPU.

It has to do with better running processes in a more secure sandbox.
ID: 632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 633 - Posted: 17 Aug 2015, 15:05:12 UTC - in response to Message 632.  

Thanks for the info.
The two other instances have ZERO run-time, so they are doing absolutely nothing.
ID: 633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,589
RAC: 87
Message 662 - Posted: 18 Aug 2015, 21:40:47 UTC - in response to Message 619.  

Yeti, solving the suspend/resume issue is high on our priority list.
ID: 662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 667 - Posted: 19 Aug 2015, 6:51:43 UTC - in response to Message 615.  

The current cmsRun in the VM is running already 11 hours and 20 minutes and consumes the 'normal ~98% of the CPU.'

I don't think that's right.

No output in the screen ALT+F4 and ALT+F5 and the cmsRun-stdout.log in the Logs is from the very first finished job.
ID: 667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 668 - Posted: 19 Aug 2015, 7:21:31 UTC - in response to Message 667.  

The current cmsRun in the VM is running already 11 hours and 20 minutes and consumes the 'normal ~98% of the CPU.'

I don't think that's right.

No output in the screen ALT+F4 and ALT+F5 and the cmsRun-stdout.log in the Logs is from the very first finished job.

This is normal behaviour on all my Clients; the only way to see if it is crunching is using the ALT F3 Screen and looking for cmsRun
ID: 668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 669 - Posted: 19 Aug 2015, 7:43:39 UTC - in response to Message 668.  

The current cmsRun in the VM is running already 11 hours and 20 minutes and consumes the 'normal ~98% of the CPU.'

I don't think that's right.

No output in the screen ALT+F4 and ALT+F5 and the cmsRun-stdout.log in the Logs is from the very first finished job.

This is normal behaviour on all my Clients; the only way to see if it is crunching is using the ALT F3 Screen and looking for cmsRun

I think you don't understand or did not read well. I'm looking already 12 hours and 12 minutes to the same cmsRun.
2 days ago I had long running jobs lasting about 3-4 hours and after ivan killed those longer ones, the jobs on my system normally lasted about 35 minutes (the equivalent of 200 records).
ID: 669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 670 - Posted: 19 Aug 2015, 8:20:45 UTC - in response to Message 669.  

I think you don't understand or did not read well. I'm looking already 12 hours and 12 minutes to the same cmsRun.

Indeed I didn't realize this, sorry
ID: 670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,946,836
RAC: 2,933
Message 671 - Posted: 19 Aug 2015, 8:27:00 UTC - in response to Message 669.  

The current cmsRun in the VM is running already 11 hours and 20 minutes and consumes the 'normal ~98% of the CPU.'

I don't think that's right.

No output in the screen ALT+F4 and ALT+F5 and the cmsRun-stdout.log in the Logs is from the very first finished job.

This is normal behaviour on all my Clients; the only way to see if it is crunching is using the ALT F3 Screen and looking for cmsRun

I think you don't understand or did not read well. I'm looking already 12 hours and 12 minutes to the same cmsRun.
2 days ago I had long running jobs lasting about 3-4 hours and after ivan killed those longer ones, the jobs on my system normally lasted about 35 minutes (the equivalent of 200 records).

I do see a job in the Condor system that has 0+12:45:50 runtime; nothing else is over 3h, so it does indeed look like something has gone amiss. Probably best to abort it so your CPU goes to better use.
ID: 671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,116
Message 673 - Posted: 19 Aug 2015, 8:39:10 UTC - in response to Message 671.  

I do see a job in the Condor system that has 0+12:45:50 runtime; nothing else is over 3h, so it does indeed look like something has gone amiss. Probably best to abort it so your CPU goes to better use.

Ok, thanks. Will reboot the VM.
ID: 673 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,946,836
RAC: 2,933
Message 675 - Posted: 19 Aug 2015, 10:17:04 UTC - in response to Message 667.  
Last modified: 19 Aug 2015, 13:21:24 UTC

The current cmsRun in the VM is running already 11 hours and 20 minutes and consumes the 'normal ~98% of the CPU.'

I don't think that's right.

No output in the screen ALT+F4 and ALT+F5 and the cmsRun-stdout.log in the Logs is from the very first finished job.

I've just come up with one like that too. When the first job was finishing there was nothing on tty4 and tty5 even though I was watching the stdout file growing as events were processed; when the second job started the cmsRun-stdout.log didn't get overwritten. There was also a failure in the stage-out, but I'll go straight to Laurence with that.

[Edit] Firstly, there was no stage-out failure; my browser hadn't refreshed...
Secondly, the later jobs which don't appear on the console or refresh the browser display appear to continue to run. I tracked one within the VM and when it finished the output file appeared on the stage-out server. (Currently we have 85.5% successful completion of the first 1000 jobs of the present batch, and 91% for the second half. [/Edit]
ID: 675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Issues running jobs


©2024 CERN