Message boards : ATLAS Application : ATLAS vbox v.1.15
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,554
RAC: 84
Message 7668 - Posted: 29 Jul 2022, 9:09:39 UTC

We just released v1.15 which uses a new vboxwrapper version 26205.
ID: 7668 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7669 - Posted: 29 Jul 2022, 9:41:24 UTC

Got a task and forced a "vbox 4.0" condition to test whether vboxwrapper can solve it automatically.
Yes. It is solved and the task started as usual.
ID: 7669 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7670 - Posted: 29 Jul 2022, 9:47:14 UTC - in response to Message 7668.  

29.07.2022 11:44:29 | lhcathome-dev | No tasks are available for ATLAS Simulation
Windows
ID: 7670 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7671 - Posted: 29 Jul 2022, 9:48:15 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3103156
1st task had only 1 event and finished with:
2022-07-29 11:43:05 (59171): Guest Log: No HITS file was produced

Was this by intention?
ID: 7671 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7675 - Posted: 29 Jul 2022, 11:01:22 UTC - in response to Message 7668.  
Last modified: 29 Jul 2022, 11:05:30 UTC

In Production Atlas-Task stopping for confirm-Error after 7-8 min itsself.
When there is no input and less 1 min. CPU,
the Task running hours, only the volunteer can stop this task.
This is for a handful tasks every day seen.
Is it possible to make a correction in this new version?
This is a example from last night (5 hours! - two of them started at the same time)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=361685586
ID: 7675 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 154
Credit: 1,352,554
RAC: 84
Message 7676 - Posted: 29 Jul 2022, 11:10:07 UTC - in response to Message 7671.  

Not intention on my part, but a new kind of task with updated ATLAS simulation software was added recently to the set of tasks automatically submitted here. I've asked the experts to look into why these tasks fail.

I have submitted manually a batch of 20 event tasks to keep the queue full.
ID: 7676 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 7677 - Posted: 29 Jul 2022, 11:17:36 UTC
Last modified: 29 Jul 2022, 11:19:54 UTC

I got an ATLAS task with a 132MB pool.root input file, so I expected some more events to process.
It says however number of events total 1
and after a short time the job was finished without I could see that single event being processed.
ID: 7677 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 7679 - Posted: 29 Jul 2022, 11:31:45 UTC - in response to Message 7676.  

I have submitted manually a batch of 20 event tasks to keep the queue full.
Sorry, I fetched 16 of them . . .

BOINC's estimated runtime is 13 minutes 34 seconds, but in fact they will need almost 2 hours each.
ID: 7679 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7680 - Posted: 29 Jul 2022, 12:30:09 UTC

This task finished with a HITS file:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3103188


Prior to the start of the task I prepared VirtualBox with a dummy vdi having the same UUID than the ATLAS vdi.
The objective was to test whether the new vboxwrapper can deal with those very rare issues.

It can:
2022-07-29 13:14:11 (13060): Disk UUID conflicts with an already existing disk.
Will set a new UUID for 'ATLAS_vbox_1.15_image.vdi'.
The project admin should be informed to do this server side running:
vboxmanage clonemedium <inputfile> <outputfile>



@David
You did it right! The error was intentionally forced!
ID: 7680 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Richie_unstable

Send message
Joined: 31 Aug 21
Posts: 11
Credit: 1,028,758
RAC: 0
Message 7683 - Posted: 29 Jul 2022, 19:17:09 UTC

Not too many tasks available ... Le attività sono basse ... Ich würde sie gerne mehr sehen.
ID: 7683 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 7686 - Posted: 30 Jul 2022, 9:15:58 UTC - in response to Message 7668.  

David Cameron wrote:
We just released v1.15 which uses a new vboxwrapper version 26205.
In the file reference of your ATLAS app version description, you still use the tag <open_name> for the vdi-file.

<file_ref>
<file_name>ATLAS_vbox_1.15_image.vdi</file_name>
<open_name>vm_image.vdi</open_name>
</file_ref>


This open_name is not needed for the multi-attach tasks and can be left out.
ID: 7686 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7687 - Posted: 30 Jul 2022, 9:29:55 UTC - in response to Message 7686.  
Last modified: 30 Jul 2022, 9:31:18 UTC

This is from a production Atlas under Win11pro, ending after 7-8 min. with confirm-Error stderr.txt:
2022-07-30 11:20:42 (17436): Guest Log: *** The last 20 lines of the pilot log: ***
2022-07-30 11:20:42 (17436): Guest Log: ---- Retrieve pilot code ----
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,676 [wrapper] Using piloturl: local
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,676 [wrapper] Only supporting pilot3 so pilotbase directory: pilot3
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,677 [wrapper] piloturl=local so download not needed
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,678 [wrapper] local tarball pilot3.tar.gz exists OK
2022-07-30 11:20:42 (17436): Guest Log: tar: Skipping to next header
2022-07-30 11:20:42 (17436): Guest Log: gzip: stdin: unexpected end of file
2022-07-30 11:20:42 (17436): Guest Log: tar: Child returned status 1
2022-07-30 11:20:42 (17436): Guest Log: tar: Error is not recoverable: exiting now
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,688 [wrapper] ERROR: pilot extraction failed for pilot3.tar.gz
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,689 [wrapper] ERROR: pilot extraction failed for pilot3.tar.gz
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,690 [wrapper] FATAL: failed to get pilot code
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,691 [wrapper] FATAL: failed to get pilot code
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,692 [wrapper] apfmon messages muted
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,693 [wrapper] ==== wrapper stdout END ====
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,694 [wrapper] ==== wrapper stderr END ====
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,695 [wrapper] wrapperfault ec=1, duration=0
2022-07-30 11:20:42 (17436): Guest Log: 2022-07-30 09:20:42,696 [wrapper] apfmon messages muted
2022-07-30 11:20:42 (17436): Guest Log: *** Listing of results directory ***
ID: 7687 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7689 - Posted: 30 Jul 2022, 10:57:33 UTC - in response to Message 7687.  

This has nothing to do with version 1.15 or vboxwrapper 26205.
The word "wrapper" from the logfile refers to another wrapper deeper in the scripts.
Since it was a task from -prod it should be reported there.
ID: 7689 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7691 - Posted: 30 Jul 2022, 11:04:22 UTC

There are not so many Atlas-Tasks in -dev to see such a problem.
Why, is there no chance for the Atlas-Team to take a deeper look?
ID: 7691 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
boboviz

Send message
Joined: 24 Oct 19
Posts: 69
Credit: 182,602
RAC: 52
Message 7693 - Posted: 30 Jul 2022, 13:59:29 UTC - in response to Message 7668.  

We just released v1.15 which uses a new vboxwrapper version 26205.


What's differences between 26204 (official version on Boinc site) and 26205??
ID: 7693 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Avatar

Send message
Joined: 28 Jul 16
Posts: 404
Credit: 374,791
RAC: 0
Message 7694 - Posted: 30 Jul 2022, 14:29:30 UTC - in response to Message 7693.  

v26205 includes a workaround to avoid errors like this:
VBoxManage.exe: error: Cannot attach medium 'D:\Boinc1\projects\lhcathomedev.cern.ch_lhcathome-dev\ATLAS_vbox_1.14_image.vdi': the media type 'MultiAttach' can only be attached to machines that were created with VirtualBox 4.0 or later
VBoxManage.exe: error: Details: code VBOX_E_INVALID_OBJECT_STATE (0x80bb0007), component SessionMachine, interface IMachine, callee IUnknown
VBoxManage.exe: error: Context: "AttachDevice(Bstr(pszCtl).raw(), port, device, DeviceType_HardDisk, pMedium2Mount)" at line 776 of file VBoxManageStorageController.cpp


See:
https://github.com/BOINC/boinc/pull/4843

We are testing the vboxwrapper pre-release from github with ATLAS/Theory/CMS and once they keep stable over the weekend we may get new app_versions on -prod shortly.
ID: 7694 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7696 - Posted: 30 Jul 2022, 20:47:00 UTC - in response to Message 7675.  

In Production Atlas-Task stopping for confirm-Error after 7-8 min itsself.
When there is no input and less 1 min. CPU,
the Task running hours, only the volunteer can stop this task.
This is for a handful tasks every day seen.
Is it possible to make a correction in this new version?
This is a example from last night (5 hours! - two of them started at the same time)
https://lhcathome.cern.ch/lhcathome/result.php?resultid=361685586


This is, when two Atlas-Tasks starting in the same second on the same PC!
<stderr_txt>
2022-07-30 18:59:49 (16048): Detected: vboxwrapper 26197
2022-07-30 18:59:49 (16048): Detected: BOINC client v7.7
2022-07-30 18:59:49 (16048): Detected: VirtualBox VboxManage Interface (Version: 6.1.36)
2022-07-30 18:59:50 (16048): Successfully copied 'init_data.xml' to the shared directory.
2022-07-30 18:59:51 (16048): Create VM. (boinc_95a7fe58546dd873, slot#5)
2022-07-30 18:59:51 (16048): Setting Memory Size for VM. (10250MB)

2022-07-30 18:59:49 (24000): Detected: vboxwrapper 26197
2022-07-30 18:59:49 (24000): Detected: BOINC client v7.7
2022-07-30 18:59:49 (24000): Detected: VirtualBox VboxManage Interface (Version: 6.1.36)
2022-07-30 18:59:50 (24000): Successfully copied 'init_data.xml' to the shared directory.
2022-07-30 18:59:52 (24000): Create VM. (boinc_e094d3f0813a1289, slot#6)
2022-07-30 18:59:52 (24000): Setting Memory Size for VM. (10250MB)

Discovered this after 2 hours runtime with less then 1 min. CPU-Time for both (using 10 CPU's).
ID: 7696 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7709 - Posted: 2 Aug 2022, 8:01:44 UTC - in response to Message 7691.  

There are not so many Atlas-Tasks in -dev to see such a problem.
Why, is there no chance for the Atlas-Team to take a deeper look?

Is it ok, to test this, when the new wrapper205 is in production?
ID: 7709 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1147
Credit: 754,546
RAC: 10
Message 7710 - Posted: 2 Aug 2022, 9:21:42 UTC - in response to Message 7709.  

There are not so many Atlas-Tasks in -dev to see such a problem.
Why, is there no chance for the Atlas-Team to take a deeper look?

Is it ok, to test this, when the new wrapper205 is in production?
It's always OK to test things, especially if something not common happens.
You have to provide as much information as possible.
What else is running on the machine. Do you use a second (Linux) VM on a (Windows) machine and run BOINC from there.
Of special interest: What do you see in VM-Console with ALT-F1?
I see very often "Checking CVMFS ....", but without a response.
Do you start several VM's at once?
ID: 7710 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 601
Credit: 1,451,357
RAC: 1,367
Message 7711 - Posted: 2 Aug 2022, 9:58:28 UTC - in response to Message 7710.  
Last modified: 2 Aug 2022, 10:00:20 UTC

This two PC's are AMD Ryzen Threadripper PRO 3995WX 64-Cores.
There running only 6 Atlas from production with 10 CPU's per Task.
One of this two PC's running two Atlas-Tasks from -dev if avalaible (The last four days not!).
They have squid avalaible from a Win10-Workstation.
100 Atlas-Tasks per day and PC.
This never ending tasks are only a handful per day.
You can see this in production for this two PC's.

ALT+F1 or ALT+F2 or ALT+F3 in Virtualbox is not avalaible.

https://lhcathome.cern.ch/lhcathome/top_hosts.php
ID: 7711 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : ATLAS Application : ATLAS vbox v.1.15


©2023 CERN