1) Message boards : Web site : Frequent YOTD (Message 5247)
Posted 13 Nov 2017 by Profile ritterm
Post:
Is there a particular reason that I'm named YOTD on what seems like almost a weekly basis? Not complaining, just curious... :-)
2) Message boards : CMS Application : Dip? (Message 4563)
Posted 21 Dec 2016 by Profile ritterm
Post:
...looks like the server at RAL is down.

Again, perhaps? Over at LHC@Home, several recent CMS jobs failing with similar output:

2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 1
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 12/20/16 22:59:37 recognized DC_NOP as command name, using command 60011.
2016-12-20 22:59:37 (29181): Guest Log: 12/20/16 22:59:37 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
2016-12-20 22:59:37 (29181): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623>
2016-12-20 22:59:37 (29181): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-20 22:59:37 (29181): Guest Log: [INFO] Shutting Down.
2016-12-20 22:59:37 (29181): VM Completion File Detected.
2016-12-20 22:59:37 (29181): VM Completion Message: Could not ping HTCondor.
3) Message boards : ALICE Application : The ALICE Application (Message 3919)
Posted 1 Aug 2016 by Profile ritterm
Post:
Just reporting on a couple of ALICE jobs that I've finished recently:

Task 230811 errored out apparently due to no work available [207 (0x000000CF) EXIT_NO_SUB_TASKS]

Task 230951 completed and validated with the following output:

2016-08-01 10:28:01 (620): Guest Log: [INFO] ALICE application starting. Check log files.
2016-08-01 10:29:56 (620): Guest Log: [INFO] New Job Starting in =slot1
2016-08-01 10:29:56 (620): Guest Log: [INFO] Condor JobID: 1391327 in =slot1
2016-08-01 10:44:40 (620): Guest Log: [INFO] Job finished in =slot1 with unknown exit code.
2016-08-01 10:50:01 (620): VM Completion File Detected.
2016-08-01 10:50:01 (620): VM Completion Message: Condor exited with return value N/A.

Is that output evidence of a "real" task (either test or otherwise)?
4) Message boards : ALICE Application : The ALICE Application (Message 3704)
Posted 14 Jul 2016 by Profile ritterm
Post:
See the second post in this thread.

Oh, yes, I understand. Given the server status information (which, I admit, I may not fully understand), however, I just wanted to make sure I hadn't missing anything. I've run a few of the work-free tasks and look forward to crunching more when it's ready to go.
5) Message boards : ALICE Application : The ALICE Application (Message 3699)
Posted 14 Jul 2016 by Profile ritterm
Post:
The server status page shows ALICE jobs running as long as 18 hours. Is there "real work" out there now?
6) Message boards : Number crunching : Host limited to only one task (Message 3693)
Posted 14 Jul 2016 by Profile ritterm
Post:
We have put it back to what it was (5).

Okay, great. I see now that my cache is full up. Glad I wasn't losing my mind! :-) Thanks!
7) Message boards : Number crunching : Host limited to only one task (Message 3690)
Posted 14 Jul 2016 by Profile ritterm
Post:
I have Computer 307 (Win7-64, AMD 8150, 16GB RAM) running vLHC-dev. I use an app_config file to limit total tasks to 2 and to run 2 of any app. Things had been going along fine for a few weeks where I'd be running 2 tasks and have a few in the queue. Sometime yesterday I noticed it was only running one vLHC-dev task and when requesting new tasks is getting:

1243 vLHCathome-dev 7/14/2016 9:35:08 AM Sending scheduler request: To fetch work.
1244 vLHCathome-dev 7/14/2016 9:35:08 AM Requesting new tasks for CPU
1245 vLHCathome-dev 7/14/2016 9:35:10 AM Scheduler request completed: got 0 new tasks
1246 vLHCathome-dev 7/14/2016 9:35:10 AM No tasks sent
1247 vLHCathome-dev 7/14/2016 9:35:10 AM No tasks are available for CMS Simulation
1248 vLHCathome-dev 7/14/2016 9:35:10 AM No tasks are available for LHCb Simulation
1249 vLHCathome-dev 7/14/2016 9:35:10 AM No tasks are available for Theory Simulation
1250 vLHCathome-dev 7/14/2016 9:35:10 AM No tasks are available for ALICE Simulation
1251 vLHCathome-dev 7/14/2016 9:35:10 AM This computer has reached a limit on tasks in progress

It looks to me like there are the usual number of available tasks and I don't see anything recent in the news and other forum posts that might be related to my experience. My project preferences are set to run all simulations except ATLAS. I've checked my app_config file, had the BOINC client re-read it (no syntax errors detected), and stopped/started the client a couple of times and seen no change. I'm afraid I'm missing something really simple.

Thanks,

MarkR
8) Message boards : News : Some jobs again (Message 859)
Posted 27 Aug 2015 by Profile ritterm
Post:
If you have a spare CPU to keep a VM task running, fine...

Why, yes, I do. In fact, I have 4! :-)
9) Message boards : Number crunching : exceeded disk limit (Message 310)
Posted 4 May 2015 by Profile ritterm
Post:
Based on posts in other project forums (MilkyWay, Einstein, WCG), this appears to be getting some attention from the BOINC developers. From David Anderson:

I looked at this and couldn't immediately see the problem.
The BOINC client deletes everything in a slot directory before using it for a new job.
If a deletion fails (e.g. because a file is in use by another app) it doesn't use
that slot directory.
I verified this by opening some Word docs in slot directories.

Notes:

* There's a "slot_debug" log flag for messages related to slot directories.
Unfortunately it doesn't print messages about failed file deletions; I'll add this.
* The "disk limit exceeded" errors refer to the per-job disk limit, not the user's
disk usage preferences; I'll change the message to clarify this.
* Apps aren't responsible for cleaning out their slot dirs; BOINC does this. It
may be that BOINC is failing to delete VM images because they're still in use by
the VirtualBox executive.

Bottom line: I'll need some more info to debug this.
If anyone is seeing this reproducibly, let me know.
Otherwise we'll release a client with more debugging output to help us investigate.

-- David
10) Message boards : Number crunching : exceeded disk limit (Message 309)
Posted 3 May 2015 by Profile ritterm
Post:
CMS-dev workunits like to leave their disk images behind...Why this causes other projects to get that exceeded disk space error even if Boinc has lot's of available space left I don't know but it's very annoying...

Very annoying, indeed. I had this problem on two different hosts and trying to figure out what was going on was driving me crazy until someone pointed me to this thread.
11) Message boards : Number crunching : Tasks Stalled at 100% (Message 305)
Posted 29 Apr 2015 by Profile ritterm
Post:
I've updated to VB 4.3.26 and have a new task running on the problem host...

...And it's now completed two tasks with no problems. Problem solved...hopefully!
12) Message boards : Number crunching : Tasks Stalled at 100% (Message 303)
Posted 27 Apr 2015 by Profile ritterm
Post:
Maybe when you get the chance you should update your old version of VB...

...You are still running the old VirtualBox Version: 4.3.12

I've updated to VB 4.3.26 and have a new task running on the problem host.
13) Message boards : Number crunching : Tasks Stalled at 100% (Message 294)
Posted 25 Apr 2015 by Profile ritterm
Post:
...I just checked that host and your other ones and they all were working and getting completed tasks and no errors.

You can't do better than that!

That's true, but...I think I ought to be able to do better than having to re-boot one host everyday after its task has "stalled out" in order to successfully complete it. ;-)

I know there can be all sorts of reasons why one host in particular has a problem, but it would be nice to know why. Maybe I'll never know, but that's why I posed the question here. :-)

Anyway, I'm taking a break from this project for a bit to work on some others prior to the BOINC Pentathlon.

Cheers!

MarkR
14) Message boards : Number crunching : Tasks Stalled at 100% (Message 292)
Posted 25 Apr 2015 by Profile ritterm
Post:
I'm going to try suspending WCG tasks when the problem host nears the end of the CMS-dev task it's running now. If the task stalls, I'll try running only CMS-dev on the next one.

...and doing neither of those things made a difference. It seems as soon as the task reaches 100%, the VM powers down, but the task continues to stay in a running state.

I ran a few ATLAS tasks just to make sure the host didn't have a general problem with VM tasks and had no issues. I uninstalled the VM-ware and upgraded to the latest BOINC and re-installed the VM-ware. The next task stalled just like before.

Oh, well...no more CMS-dev tasks for this host.
15) Message boards : Number crunching : Task won't start (Message 283)
Posted 23 Apr 2015 by Profile ritterm
Post:
So, it works. I don't know what helped it to get in gear...

Well, good news and bad news, I guess. I suspect it was the reboot clearing up something, but, who knows. Using VMs adds a layer of complexity -- which is not necessarily a bad thing -- and I've seen tasks running under them behave strangely...or not behave, at all. :D

Anyway, glad you are back in business!

Crunch on,

MarkR
16) Message boards : Number crunching : Task won't start (Message 281)
Posted 22 Apr 2015 by Profile ritterm
Post:
I think the cpu_sched and/or cpu_sched_debug flags are what might be helpful...

And, maybe task_debug, too?
17) Message boards : Number crunching : Job queue empty!! (Message 280)
Posted 22 Apr 2015 by Profile ritterm
Post:
ERROR:root:No message received! Nothing to do! . . .

I'm not sure I understand... Where is this error message coming from and what job queue are you saying is empty? I see plenty of tasks available on the server status page and my hosts haven't had any problem getting work, but maybe that's not what you are referring to.
18) Message boards : Number crunching : Task won't start (Message 279)
Posted 22 Apr 2015 by Profile ritterm
Post:
Nothing specific from CMS in the Event Log. I could turn on all the debug flags and do the same sequence of actions...

If you end up wanting to try this, it turns out the newer BOINC manager has the ability to easily enable logging flags. The following text is from the Client Configuration portion of the BOINC client documentation:

Beginning with version 7.4.26, the BOINC Manager has a dialog to make it easier to edit the logging options. You can access this dialog by selecting Event Log Diagnostic Flags from the Advanced menu or by pressing CTRL+Shift+F simultaneously on the keyboard in either the Advanced View or the Simple View (Command+Shift+F on a Macintosh.)

I think the cpu_sched and/or cpu_sched_debug flags are what might be helpful. Follow the link to the documentation for details. You may find other flags to be helpful, as well.
19) Message boards : Number crunching : Task won't start (Message 275)
Posted 22 Apr 2015 by Profile ritterm
Post:
Sooooo... What is going on?

I see that you have ATLAS and vLHC credit, but does any of that belong to the host in question? Just looking to confirm that you've successfully run VM tasks from other projects on the problem host. If so, is it possible something has changed (e.g., virtualization got disabled in the BIOS somehow)? Will other VM tasks run on that host right now or any of your other hosts that might be having the same problem?

Are there any CMS-dev-related messages in your BOINC manager that provide any insight? I know it's not much help since I don't know off the top of my head how to do it, but you can enable certain debug messages (not sure if "debug" is the right term) that show reasons tasks do or don't run.

Since attaching to CMS-dev, have you re-booted the host to see if that has an effect?
20) Message boards : Number crunching : Tasks Stalled at 100% (Message 272)
Posted 22 Apr 2015 by Profile ritterm
Post:
...It does depend on if you run other projects at the same time or like over at Atlas if you try to run more than one task when you don't have enough Ram.

Two of your hosts only have 6 and 4GB ram so you can have problems if you try running a CMS-dev along with other tasks on a host such as your quad-core with only 4GB ram...

Thanks for the feedback, MAGIC.

I haven't had any problem (yet?) with my 4GB host and have completed two CMS-dev tasks. It is running 2 WCG CPU tasks alongside CMS-dev and has one core reserved to support the GPU running GPUGrid.

On my problem host, the one with 6GB RAM, the only other project I'm running is WCG. Looking in Windows Task Manager, I haven't seen it using more than about 45% of the available RAM while running CMS-dev and WCG tasks. Of course, I'm not sure I understand how/if VM resources are shown in WTM. For instance, it looks like the 3 WCG tasks are at 25% each, with system idle taking the other 25%, even though the CMS-dev task is running and showing progress. In addition, I don't see any VM resources taking much/any RAM.

I'm going to try suspending WCG tasks when the problem host nears the end of the CMS-dev task it's running now. If the task stalls, I'll try running only CMS-dev on the next one.


Next 20


©2020 CERN