Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4684 - Posted: 23 Feb 2017, 4:37:25 UTC

Pretty much all tasks of BOINC projects using a VM were failing on my hosts, so right after Valentine's Day, I took the time to update all my hosts to the latest OS (Microsoft Windows 10 Professional x64 Edition, (10.00.14393.00)), BOINC client (7.6.33 (x64)), VirtualBox (5.1.14), and take out AVG as the AntiVirus and leave the built-in Windows Defender. Since then, on all my hosts, almost all ATLAS@Home (140+ successes vs. 10 fails since Feb 18) and LHC@Home (25 successes (mix of LHCb & Sixtrack) vs. 5 LHCb fails today) tasks have succeeded, which was not the case for a looooooong time. The above numbers are what the project sites give me for recent runs, but I have been looking at them since my updates, and they all have been remarkably clean.

Well... Except for LHCathome-dev.

Looking at it right now, filtering out manually aborted & inprogress tasks, among the 396 tasks sent & returned since Feb 14, only 6 "Completed and validated"; the remaining 390 got the "Error while computing". I first tried to see if there was a pattern as to the application, but nope; they are all "CMS Simulation v48.00 (vbox64_mt_mcore_cms) windows_x86_64". Then I looked at the hosts they ran on to see if it was a certain host that was failing consistently, or a certain host that was consistently NOT failing. No dice. They are all failing. The only interesting thing is that the only host that ever returned successful tasks was http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=318 which is an old PC with an old slow CPU and an old GTX 480 for GPU. But, outside those 6 valid tasks, it also returned 17 errors, too. The remaining 4 hosts (http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=920, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1553, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1554, http://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=1725) only returned results that were errors. And, the first three of those four hosts are running damn near high end hardware (Hyperthreading 8-core i7-5960X with dual or quad TITAN X or TITAN Z with 32G or 64G RAM).

I don't have enough expertise to look at the logs and pinpoint the error exactly, but following sort of stuff (marked with ****) caught my eye (this one is for Task 310885 with exit status "206 (0x000000CE) EXIT_INIT_FAILURE"):

. . .
2017-02-20 19:45:34 (9836): Guest Log: VBoxService 4.3.28 r100309 (verbosity: 0) linux.amd64 (May 13 2015 17:11:31) release log
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000054 main     Log opened 2017-02-21T03:45:27.457349000Z
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000195 main     OS Product: Linux
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000227 main     OS Release: 4.1.37-26.cernvm.x86_64
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000253 main     OS Version: #1 SMP Thu Jan 5 15:23:42 CET 2017
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000277 main     OS Service Pack: #1 SMP Thu Jan 5 15:23:42 CET 2017
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000302 main     Executable: /usr/sbin/VBoxService
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000302 main     Process ID: 2927
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.000303 main     Package type: LINUX_64BITS_GENERIC
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.001172 main     4.3.28 r100309 started. Verbose level = 0
2017-02-20 19:45:34 (9836): Guest Log: 00:00:00.008723 automount VBoxServiceAutoMountWorker: Shared folder "shared" was mounted to "/media/sf_shared"
2017-02-20 19:45:54 (9836): Guest Log: [INFO] Mounting the shared directory
2017-02-20 19:45:54 (9836): Guest Log: [INFO] Shared directory mounted, enabling vboxmonitor
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing network connection to cern.ch on port 80
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to cern.ch 80 port [tcp/http] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing CVMFS connection to lhchomeproxy.cern.ch on port 3125
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to lhchomeproxy.cern.ch 3125 port [tcp/a13-an] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing VCCS connection to vccs1.cern.ch on port 443
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to vccs1.cern.ch 443 port [tcp/https] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Testing connection to Condor server on port 9618
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Connection to vccondor01.cern.ch 9618 port [tcp/condor] succeeded!
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] 0
2017-02-20 19:45:54 (9836): Guest Log: [DEBUG] Probing CVMFS ...
2017-02-20 19:45:54 (9836): Guest Log: Probing /cvmfs/grid.cern.ch... OK
2017-02-20 19:45:54 (9836): Guest Log: Probing /cvmfs/cms.cern.ch... OK
2017-02-20 19:45:54 (9836): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2017-02-20 19:45:54 (9836): Guest Log: 2.2.0.0 3633 0 21804 3958 14 1 1651184 10240001 2 65024 0 20 100 15200 2 http://cvmfs.fnal.gov/cvmfs/grid.cern.ch http://131.225.205.134:3125 1
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Reading volunteer information
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Volunteer: Tuna Ertemalp (206) Host: 318
2017-02-20 19:46:14 (9836): Guest Log: [INFO] VMID: e157435d-c4c6-41b0-bda1-b31c0f9afa17
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from vLHC@home
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from LHC@home
2017-02-20 19:46:14 (9836): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2017-02-20 19:46:14 (9836): Guest Log: [INFO] CMS application starting. Check log files.
**** 2017-02-20 19:46:14 (9836): Guest Log: [DEBUG] HTCondor ping
**** 2017-02-20 19:46:14 (9836): Guest Log: [DEBUG] 0
**** 2017-02-20 19:56:24 (9836): Guest Log: [ERROR] Condor exited after 614s without running a job.
2017-02-20 19:56:24 (9836): Guest Log: [INFO] Shutting Down.
2017-02-20 19:56:24 (9836): VM Completion File Detected.
**** 2017-02-20 19:56:24 (9836): VM Completion Message: Condor exited after 614s without running a job.
. . .


I don't know if all of my 390 failures are the same, but a random spot check of a dozen showed the same, except for one: "-1073740791 (0xC0000409) STATUS_STACK_BUFFER_OVERRUN" for Task 311043. Of course, there might be more tasks with different errors, but I don't know of an easy way to query for those without access to the database, other than looking at them one by one by separate clicks on the website & taking notes... Blech.

Can anyone on the project team please look at my failed tasks and tell me if there is anything I can do? Since I have all the latest stuff, and since all other projects seem to have a high success rate using this VirtualBox (ATLAS@Home, LHC@Home, Sourcefinder Duchamp, RNA World), and all these hosts have ridiculously plenty of CPU/HD/RAM, the usual responses along the lines of getting the latest or making sure of disk space won't work.

Thanks!
Tuna
ID: 4684 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 4685 - Posted: 23 Feb 2017, 8:18:07 UTC

I took the time to update all my hosts to the latest OS (Microsoft Windows 10 Professional x64 Edition

The above is my first suspicion.
Win10 has Hyper-V on board and is default enabled.
This is conflicting with VirtualBox.
Disable Hyper-V in Windows 10.
ID: 4685 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4686 - Posted: 23 Feb 2017, 8:26:31 UTC - in response to Message 4685.  
Last modified: 23 Feb 2017, 8:33:55 UTC

Crystal Pellet wrote:
Tuna Ertemalp wrote:
I took the time to update all my hosts to the latest OS (Microsoft Windows 10 Professional x64 Edition

The above is my first suspicion.
Win10 has Hyper-V on board and is default enabled.
This is conflicting with VirtualBox.
Disable Hyper-V in Windows 10.

I will try this. However, then how come:

Tuna Ertemalp wrote:
all other projects seem to have a high success rate using this VirtualBox (ATLAS@Home, LHC@Home, Sourcefinder Duchamp, RNA World)

Tuna
ID: 4686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4687 - Posted: 23 Feb 2017, 8:33:09 UTC - in response to Message 4686.  
Last modified: 23 Feb 2017, 8:34:16 UTC

Tuna Ertemalp wrote:
I will try this.

And, guess what? None of those machines had Hyper-V turned on. It seems the feature is not turned on by default since I don't at all remember paying attention to turning it off during install. Then again, the OS installs were done many months ago with Windows Update keeping them up to par; I simply made sure everything was really up-to-date a week ago.

Tuna
ID: 4687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4688 - Posted: 23 Feb 2017, 8:41:13 UTC - in response to Message 4684.  

Most of the errors show the status code "206 (0x000000CE) EXIT_INIT_FAILURE".
This is a side-effect caused by the WMAgent upgrade explained in this thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4124
ID: 4688 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4689 - Posted: 23 Feb 2017, 8:55:47 UTC - in response to Message 4688.  

computezrmle wrote:
Most of the errors show the status code "206 (0x000000CE) EXIT_INIT_FAILURE".
This is a side-effect caused by the WMAgent upgrade explained in this thread:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4124

Looking at that thread, seems there was nothing to do on the client side. Are you saying that people getting CMS jobs from LHC@Home are currently seeing this problem (I have only one such task on one host that is still in progress), or are you saying that the 50 unsent and 132 in progress CMS jobs by lhcathome-dev (http://lhcathomedev.cern.ch/lhcathome-dev/server_status.php) have a problem that was solved on LHC@Home?

Tuna
ID: 4689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4691 - Posted: 23 Feb 2017, 9:39:07 UTC - in response to Message 4689.  

The dev-project and the non-dev-project send out different versions of the VBox VM.
This step worked stable during the last weeks and the number of available WUs can be seen on the server status pages.

Once your local host has started a VM that VM contacts CERN´s job distribution system (WMAgent, condor, ...).
Those system serves both, the dev-project as well as the non-dev-project.

A process inside the VM ensures that the VM aborts if it can´t get a job.
Unfortunately this process ends up in different ways.
1. If the VM had already delivered at least one result it is treated as successful.
2. If there was no result the whole WU is treated as an error.
ID: 4691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4692 - Posted: 23 Feb 2017, 9:47:15 UTC - in response to Message 4691.  
Last modified: 23 Feb 2017, 9:47:26 UTC

computezrmle wrote:
The dev-project and the non-dev-project send out different versions of the VBox VM.

So, this means that one shouldn't be running lhc@home and lhcathome-dev on the same host. Am I understanding this correctly?

Thanks
Tuna
ID: 4692 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4693 - Posted: 23 Feb 2017, 10:07:02 UTC - in response to Message 4692.  

... Am I understanding this correctly? ...

No.
Run both on the same host if it has enough resources.

I just wanted to point out that both projects share the same job source and therefore show the same error once that job source is empty.
The result were those "206 (0x000000CE) EXIT_INIT_FAILURE" errors in both projects.
ID: 4693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4696 - Posted: 23 Feb 2017, 16:30:18 UTC - in response to Message 4693.  

Sorry, but I am not following. Completely apologize if I am missing something completely...

As I explained:

- the non-dev project(s) are working just fine with almost no problem: almost ALL of the tasks are a success
- the dev project is the ONLY one that is returning errors: almost ALL of the tasks are an error

And, the machines have all the resources they would need: RAM/HD/etc.

And, both the dev and the non-dev projects seem to have plenty of jobs to be sent out as shown by the server status pages: currently 274 jobs at LHC@Home of which 86 are CMS, and 352 jobs at lhcathome-dev of which 50 are CMS. So, there shouldn't be any problems with getting a job from CERN's job queue, and therefore my VM returning an error the way you are describing due to lack of a job (as far as I understand).

Even if these VMs were catching the CERN job servers empty or fail to connect to them, why would it be ALWAYS against lhcathome-dev and almost never against lhc@home?

Tuna
ID: 4696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 4699 - Posted: 23 Feb 2017, 17:55:23 UTC - in response to Message 4696.  

OK, let´s start from the scratch and focus on the CMS subproject on lhcathomedev.cern.ch.

You have 6 hosts connected and they show lots of failed WUs with either "-1073740791 (0xC0000409) STATUS_STACK_BUFFER_OVERRUN" or "206 (0x000000CE) EXIT_INIT_FAILURE".

Unfortunately I can´t help to solve the STATUS_STACK_BUFFER_OVERRUN but Crystal Pellet already made a comment and I would recheck the settings on the affected host.

Regarding the EXIT_INIT_FAILURE:
It should be solved as this error was caused by the project and not by your local settings.
ID: 4699 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tuna Ertemalp
Avatar

Send message
Joined: 21 Apr 15
Posts: 15
Credit: 106,597
RAC: 0
Message 4700 - Posted: 23 Feb 2017, 19:10:52 UTC - in response to Message 4699.  

computezrmle wrote:
OK, let´s start from the scratch and focus on the CMS subproject on lhcathomedev.cern.ch.

Thanks!

computezrmle wrote:
Unfortunately I can´t help to solve the STATUS_STACK_BUFFER_OVERRUN but Crystal Pellet already made a comment and I would recheck the settings on the affected host.

The recommendation was to turn off Hyper-V, but Hyper-V was already turned off on all of my hosts. Besides, again, no other VM-based project (Atlas, RNA, Duchamp, LHC@Home) is having this problem except for CMS from lhchome-dev on the very same hosts.

computezrmle wrote:
Regarding the EXIT_INIT_FAILURE:
It should be solved as this error was caused by the project and not by your local settings.

OK, so there is nothing to do for me, then, on my hosts. I guess I'll leave my hosts with NoNewTasks for lhcathome-dev, wait for a few days/weeks until hopefully all the current CMS jobs on the lhcathome-dev queue are gone through to others, hopefully the new tasks use the same VM as that of the non-dev lhc@home, and I won't see these errors again.

That said, I find it still a little curious that I might be the only one with this problem on all of my hosts. Maybe other users have hosts dedicated only to the dev project, and therefore don't see this issue?

Thanks for the help!
Tuna
ID: 4700 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : *Almost* all of my CMS tasks are failing across all of my 5 hosts


©2024 CERN