1) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4700)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
computezrmle wrote:
OK, let´s start from the scratch and focus on the CMS subproject on lhcathomedev.cern.ch.

Thanks!

computezrmle wrote:
Unfortunately I can´t help to solve the STATUS_STACK_BUFFER_OVERRUN but Crystal Pellet already made a comment and I would recheck the settings on the affected host.

The recommendation was to turn off Hyper-V, but Hyper-V was already turned off on all of my hosts. Besides, again, no other VM-based project (Atlas, RNA, Duchamp, LHC@Home) is having this problem except for CMS from lhchome-dev on the very same hosts.

computezrmle wrote:
Regarding the EXIT_INIT_FAILURE:
It should be solved as this error was caused by the project and not by your local settings.

OK, so there is nothing to do for me, then, on my hosts. I guess I'll leave my hosts with NoNewTasks for lhcathome-dev, wait for a few days/weeks until hopefully all the current CMS jobs on the lhcathome-dev queue are gone through to others, hopefully the new tasks use the same VM as that of the non-dev lhc@home, and I won't see these errors again.

That said, I find it still a little curious that I might be the only one with this problem on all of my hosts. Maybe other users have hosts dedicated only to the dev project, and therefore don't see this issue?

Thanks for the help!
Tuna
2) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4696)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
Sorry, but I am not following. Completely apologize if I am missing something completely...

As I explained:

- the non-dev project(s) are working just fine with almost no problem: almost ALL of the tasks are a success
- the dev project is the ONLY one that is returning errors: almost ALL of the tasks are an error

And, the machines have all the resources they would need: RAM/HD/etc.

And, both the dev and the non-dev projects seem to have plenty of jobs to be sent out as shown by the server status pages: currently 274 jobs at LHC@Home of which 86 are CMS, and 352 jobs at lhcathome-dev of which 50 are CMS. So, there shouldn't be any problems with getting a job from CERN's job queue, and therefore my VM returning an error the way you are describing due to lack of a job (as far as I understand).

Even if these VMs were catching the CERN job servers empty or fail to connect to them, why would it be ALWAYS against lhcathome-dev and almost never against lhc@home?

Tuna
3) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4692)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
computezrmle wrote:
The dev-project and the non-dev-project send out different versions of the VBox VM.

So, this means that one shouldn't be running lhc@home and lhcathome-dev on the same host. Am I understanding this correctly?

Thanks
Tuna
4) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4689)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
computezrmle wrote:
Most of the errors show the status code "206 (0x000000CE) EXIT_INIT_FAILURE".
This is a side-effect caused by the WMAgent upgrade explained in this thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4124

Looking at that thread, seems there was nothing to do on the client side. Are you saying that people getting CMS jobs from LHC@Home are currently seeing this problem (I have only one such task on one host that is still in progress), or are you saying that the 50 unsent and 132 in progress CMS jobs by lhcathome-dev (http://lhcathomedev.cern.ch/lhcathome-dev/server_status.php) have a problem that was solved on LHC@Home?

Tuna
5) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4687)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
Tuna Ertemalp wrote:
I will try this.

And, guess what? None of those machines had Hyper-V turned on. It seems the feature is not turned on by default since I don't at all remember paying attention to turning it off during install. Then again, the OS installs were done many months ago with Windows Update keeping them up to par; I simply made sure everything was really up-to-date a week ago.

Tuna
6) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4686)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
Crystal Pellet wrote:
Tuna Ertemalp wrote:
I took the time to update all my hosts to the latest OS (Microsoft Windows 10 Professional x64 Edition

The above is my first suspicion.
Win10 has Hyper-V on board and is default enabled.
This is conflicting with VirtualBox.
Disable Hyper-V in Windows 10.

I will try this. However, then how come:

Tuna Ertemalp wrote:
all other projects seem to have a high success rate using this VirtualBox (ATLAS@Home, LHC@Home, Sourcefinder Duchamp, RNA World)

Tuna
7) Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts (Message 4684)
Posted 23 Feb 2017 by Profile Tuna Ertemalp
Post:
Pretty much all tasks of BOINC projects using a VM were failing on my hosts, so right after Valentine's Day, I took the time to update all my hosts to the latest OS (Microsoft Windows 10 Professional x64 Edition, (10.00.14393.00)), BOINC client (7.6.33 (x64)), VirtualBox (5.1.14), and take out AVG as the AntiVirus and leave the built-in Windows Defender. Since then, on all my hosts, almost all ATLAS@Home (140+ successes vs. 10 fails since Feb 18) and LHC@Home (25 successes (mix of LHCb & Sixtrack) vs. 5 LHCb fails today) tasks have succeeded, which was not the case for a looooooong time. The above numbers are what the project sites give me for recent runs, but I have been looking at them since my updates, and they all have been remarkably clean.

Well... Except for LHCathome-dev.

Looking at it right now, filtering out manually aborted & inprogress tasks, among the 396 tasks sent & returned since Feb 14, only 6 "Completed and validated"; the remaining 390 got the "Error while computing". I first tried to see if there was a pattern as to the application, but nope; they are all "CMS Simulation v48.00 (vbox64_mt_mcore_cms) windows_x86_64". Then I looked at the hosts they ran on to see if it was a certain host that was failing consistently, or a certain host that was consistently NOT failing. No dice. They are all failing. The only interesting thing is that the only host that ever returned successful tasks was http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=318 which is an old PC with an old slow CPU and an old GTX 480 for GPU. But, outside those 6 valid tasks, it also returned 17 errors, too. The remaining 4 hosts (http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=920, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1553, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1554, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1725) only returned results that were errors. And, the first three of those four hosts are running damn near high end hardware (Hyperthreading 8-core i7-5960X with dual or quad TITAN X or TITAN Z with 32G or 64G RAM).

I don't have enough expertise to look at the logs and pinpoint the error exactly, but following sort of stuff (marked with ****) caught my eye (this one is for Task 310885 with exit status "206 (0x000000CE) EXIT_INIT_FAILURE"):

. . .
2017-02-20 19:45:34 (9836): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000054 main     Log opened 2017-02-21T03:45:27.457349000Z
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000195 main     OS Product: Linux
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000227 main     OS Release: 4.1.37-26.cernvm.x86_64
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000253 main     OS Version: #1 SMP Thu Jan 5 15:23:42 CET 2017
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000277 main     OS Service Pack: #1 SMP Thu Jan 5 15:23:42 CET 2017
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000302 main     Executable: /usr/sbin/VBoxService
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000302 main     Process ID: 2927
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000303 main     Package type: LINUX_64BITS_GENERIC
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.001172 main     4.3.28 r100309 started. Verbose level = 0
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.008723 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"
2017-02-20 19:45:54 (9836): Guest Log: [INFO] Mounting the shared directory
2017-02-20 19:45:54 (9836): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Probing CVMFS ...
2017-02-20 19:45:54 (9836): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-02-20 19:45:54 (9836): Guest Log: Probing /cvmfs/cms.cern.ch... OK
2017-02-20 19:45:54 (9836): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-02-20 19:45:54 (9836): Guest Log: 2.2.0.0 3633 0 21804 3958 14 1 1651184 10240001 2 65024 0 20 100 15200 2 http://cvmfs.fnal.gov/cvmfs/grid.cern.ch http://131.225.205.134:3125 1
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Reading volunteer information
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Volunteer: Tuna Ertemalp (206) Host: 318
2017-02-20 19:46:14 (9836): Guest Log: [INFO] VMID: e157435d-c4c6-41b0-bda1-b31c0f9afa17
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2017-02-20 19:46:14 (9836): Guest Log: [INFO] CMS application starting. Check log files.
**** 2017-02-20 19:46:14 (9836): Guest Log: [DEBUG] HTCondor ping
**** 2017-02-20 19:46:14 (9836): Guest Log: [DEBUG] 0
**** 2017-02-20 19:56:24 (9836): Guest Log: [ERROR] Condor exited after 614s without running a job.
2017-02-20 19:56:24 (9836): Guest Log: [INFO] Shutting Down.
2017-02-20 19:56:24 (9836): VM Completion File Detected.
**** 2017-02-20 19:56:24 (9836): VM Completion Message: Condor exited after 614s without running a job.
. . .


I don't know if all of my 390 failures are the same, but a random spot check of a dozen showed the same, except for one: "-1073740791 (0xC0000409) STATUS_STACK_BUFFER_OVERRUN" for Task 311043. Of course, there might be more tasks with different errors, but I don't know of an easy way to query for those without access to the database, other than looking at them one by one by separate clicks on the website & taking notes... Blech.

Can anyone on the project team please look at my failed tasks and tell me if there is anything I can do? Since I have all the latest stuff, and since all other projects seem to have a high success rate using this VirtualBox (ATLAS@Home, LHC@Home, Sourcefinder Duchamp, RNA World), and all these hosts have ridiculously plenty of CPU/HD/RAM, the usual responses along the lines of getting the latest or making sure of disk space won't work.

Thanks!
Tuna
8) Message boards : Number crunching : task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion. (Message 307)
Posted 30 Apr 2015 by Profile Tuna Ertemalp
Post:
As far as I can see, these happen if the system is shut down for a while.
For some reason, the wrapper fails to restart the VM.


See http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=41&postid=306#306; I'm having the same problem. And, I am not shutting down my machines. They are running pretty much 24/7.
9) Message boards : Number crunching : CMS VBox cannot access vm_image.vdi (Message 306)
Posted 30 Apr 2015 by Profile Tuna Ertemalp
Post:
So, I cancelled that task, since nobody asked me for more info from that machine. But, since then, I am noticing it is happening on a lot of my machines, all because vm_image.vdi became inaccessible, with similar data as I have already linked to earlier via screenshots.

No CMS admin/programmer really is interested in this??

Tuna
10) Message boards : Number crunching : CMS VBox cannot access vm_image.vdi (Message 302)
Posted 27 Apr 2015 by Profile Tuna Ertemalp
Post:
Have you checked the VB Virtual Media Manager to see if you have any flagged files that need to be deleted? In the BOINC manager do you have the "leave application in memory when suspended" option checked? I'm asking because it can gobble up a lot of memory and the time for swapping the virtual memory with physical memory can cause the error you are seeing. Just a thought from a foolish old man that hasn't programmed since the 1980s.


Nobody who was in programming during the 80s was a fool... :)

I believe your question about VB/VMM is answered in the 2nd screenshot I linked to. There is this vm_image.vdi that seems to be in trouble, but I don't know what put it into that troublesome state. There doesn't seem to be any other alerts about anything else.

I used to have the "leave apps in mem" turned on a few weeks ago, before I even heard of CMS, and quickly realized how my 12G RAM was filling up under the weight of suspended tasks from a subset of 40+ projects running on this machine, with apps set to switch every 15mins. So, within a few days, I regretted that decision and turned it off. I guess my next machine will need to be 256G RAM!

Any other ideas?

Tuna
11) Message boards : Number crunching : CMS VBox cannot access vm_image.vdi (Message 296)
Posted 26 Apr 2015 by Profile Tuna Ertemalp
Post:
So, I don't know why the first one worked, this one is not (at least for now), and don't know if it will get out of this Postponed state eventually and continue and finish.

Postponed state of the VM (Boinc-task waiting to run) often happens when the VirtualBox process VBoxSVC.exe was not able to react in a timely fashion.
Maybe too busy or too many VM's running.
After 86400 seconds BOINC will try to restart the task.
You could restart BOINC too speed up the resume of the task.


Hmmm... That doesn't seem to be the case. To test the situation, I suspended all projects, including CMS, rebooted the machine, waited until everything settled down during boot, then only un-suspended CMS, which means that this was the only task running. Still, after about 15mins, it entered the same postponed state at 61.480% with this in the log "4/26/2015 2:46:48 PM | CMS-dev | task postponed 86400.000000 sec: VM Hypervisor failed to enter an online state in a timely fashion."

Booting the Oracle VM VBox, I first see http://1drv.ms/1JtEdJl, then http://1drv.ms/1JtEnjK after clicking Check.

So, somehow the vm_image.vdi seems to be bad. Since nobody other than CMS is using it, there might be a problem with whatever CMS is doing.

If anybody wants any data from my machine, please let me know. I'll keep this task around for a while as-is.

Tuna
12) Message boards : Number crunching : CMS VBox cannot access vm_image.vdi (Message 290)
Posted 25 Apr 2015 by Profile Tuna Ertemalp
Post:
So, I have many projects crunching, including CMS/ATLAS/vLHC. As such, sometimes they run at the same time. That is what is happening right now, and CMS is in the "postponed" state: http://1drv.ms/1QrWc4H

I looked into Oracle VM VBox, and saw that both of the vLHC (http://1drv.ms/1QrWF71 and http://1drv.ms/1QrWGrz) and CMS (http://1drv.ms/1QrWHM5) are trying to mount vm_image.vdi, with vLHC succeeding and CMS failing. Also, ATLAS was running, with no problems: http://1drv.ms/1QrWIzS

At first, I had thought they were mounting the same file, but upon further investigation in Oracle VM VBox, it seems vLHC VBoxes were using D:\BOINC_DATA\slots\31\vm_image.vdi and D:\BOINC_DATA\slots\25\vm_image.vdi, while CMS was attaching to D:\BOINC_DATA\slots\11\vm_image.vdi. In other words, different files.

Then, when I hovered my mouse pointer over the filename, a descriptive message popped up: http://1drv.ms/1ITChqd

The XML file it is referring to is at http://1drv.ms/1E2j9VH.

This machine has already processed one CMS job and returned it two days ago for 749.55 credit, and this would be the second CMS job it is working on: http://boincai05.cern.ch/CMS-dev/results.php?hostid=315

So, I don't know why the first one worked, this one is not (at least for now), and don't know if it will get out of this Postponed state eventually and continue and finish.

Just thought I'd let folk know in case this means something.

Thanks
Tuna
13) Message boards : Number crunching : Task won't start (Message 282)
Posted 23 Apr 2015 by Profile Tuna Ertemalp
Post:
Well, I came home, suspended all projects, restarted the machine, forced the download of some jobs for ATLAS and vLHC (to test later if CMS still didn't work), turned on all the Log flags mentioned here and a few more I thought might be useful, unsuspended CMS, and it started showing progress, along with a VBox being created and getting into the running state...

So, it works. I don't know what helped it to get in gear. Reboot? Downloading ATLAS/vLHC jobs? Turning on more logging?... I have seen stranger things happen over my years.. :-)

Anyways, it works now. It is worth noting that on all my other machines that are VBox capable, CMS started its first task without the need of a reboot. So, one wonders why this one needed it.

Thanks for the help
Tuna
14) Message boards : Number crunching : Task won't start (Message 276)
Posted 22 Apr 2015 by Profile Tuna Ertemalp
Post:
Yes, the host has ATLAS and vLHC credits. Nothing changed in the BIOS. I will have to get back home to verify if another VBox app is able to run, but I am pretty sure there is no problem there. I'll doublecheck.

Nothing specific from CMS in the Event Log. I could turn on all the debug flags and do the same sequence of actions...

The host wasn't rebooted since my morning yesterday, since singing up to CMS. I'll try that, too.

One more thing I noticed: BOINC Manager thinks of the task as "Active" when I filter it to Show Active Tasks as opposed to Show All Tasks. Yet, stuck at 0%, Waiting To Run. Weird...
15) Message boards : Number crunching : Task won't start (Message 274)
Posted 22 Apr 2015 by Profile Tuna Ertemalp
Post:
So, I signed up to CMS, Suspend'd & NoNewTask'd the other 40+ projects running on my biggest machine with 6 cores/12 threads and GTX 480 on Win7, added CMS to it, forced an Update, which received a single task: CMS_30012_1427806601.760414. I would have expected it to start running immediately since everything else is suspended. Nope. It is stuck at "Waiting to run". Just to test, I also Resume'd WUprop@Home, and that task immediately started running, with CMS still staying at "Waiting to run". When I Resume everything else, all other tasks start running or waiting for their turn, except for CMS.

I am running the latest BOINC (7.4.42) & VirtualBox (4.3.12), with plenty of disk & memory available. My VBox has no Machines shown in it.

I have 6 other machines, all of which also are attached to CMS, 4 of which having received a single task each, and I am pretty sure they are in the same state, but I didn't check.

Sooooo... What is going on?

Thanks
Tuna



©2024 CERN