Message boards : ATLAS Application : Testing CentOS 7 vbox image
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6663 - Posted: 20 Sep 2019, 17:34:51 UTC

That's it, finally. The last 4 events:

Size ATLAS_hits-file 236120.72 K
Result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2824181
ID: 6663 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 819
Message 6664 - Posted: 21 Sep 2019, 6:56:31 UTC - in response to Message 6663.  

Crystal +1,
one of those longrunner is finishing successful.
The other was suspended for hours tonight and running the last 110 Collisions.
24:24:49.283821 Changing the VM state from 'RUNNING' to 'SUSPENDING'
24:24:49.415819 PDMR3Suspend: 131 902 163 ns run time
24:24:49.415850 Changing the VM state from 'SUSPENDING' to 'SUSPENDED'
24:24:49.415869 Console: Machine state changed to 'Paused'
28:47:11.407187 Changing the VM state from 'SUSPENDED' to 'RESUMING'
28:47:11.425768 Changing the VM state from 'RESUMING' to 'RUNNING'
28:47:11.425794 Console: Machine state changed to 'Running'
28:47:12.092070 TMR3UtcNow: nsNow=1 569 044 722 483 850 882 nsPrev=1 569 028 975 666 006 203 -> cNsDelta=15 746 817 844 679 (offLag=43 938 155 218 offVirtualSync=49 368 776 404 844 offVirtualSyncGivenUp=49 324 838 249 626, NowAgain=1 569 044 766 422 006 100)
28:47:12.105710 VMMDev: Guest Log: 24:23:24.531842 timesync vgsvcTimeSyncWorker: Radical host time change: 15 746 817 000 000ns (HostNow=1 569 044 722 483 000 000 ns HostLast=1 569 028 975 666 000 000 ns)
28:47:22.107541 VMMDev: Guest Log: 24:23:34.533605 timesync vgsvcTimeSyncWorker: Radical guest time change: 15 731 453 162 000ns (GuestNow=1 569 044 732 497 850 000 ns GuestLast=1 569 029 001 044 688 000 ns fSetTimeLastLoop=true )
ID: 6664 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 819
Message 6665 - Posted: 21 Sep 2019, 11:38:37 UTC - in response to Message 6664.  

RDP shows for the collisions UTC-Time. (Windows 10pro)
ID: 6665 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 819
Message 6667 - Posted: 22 Sep 2019, 22:28:26 UTC

The second finished after more than 2 days:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2824209
2019-09-22 11:40:53 (10716): Guest Log: HITS file was successfully produced
2019-09-22 11:40:53 (10716): Guest Log: -rw-------. 1 atlas atlas 249017334 Sep 22 09:38 /home/atlas/RunAtlas/HITS.19000509._016046.pool.root.1
ID: 6667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6741 - Posted: 4 Oct 2019, 13:22:36 UTC

I'm planning to finally release the CentOS 7 image into production next week. From what I can see here most tasks finish ok except for the usual badly configured machines and internal ATLAS errors.

Please shout if you have any last-minute problems or requests!
ID: 6741 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 6742 - Posted: 4 Oct 2019, 13:30:27 UTC

It’s great!
In my case, the ALT-F2 takes a couple of tries to make it work.
ID: 6742 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 6744 - Posted: 4 Oct 2019, 15:11:04 UTC - in response to Message 6741.  

I'm planning to finally release the CentOS 7 image into production next week. From what I can see here most tasks finish ok except for the usual badly configured machines and internal ATLAS errors.

Please shout if you have any last-minute problems or requests!

Found this in the stderr.txt of a recently downloaded task:
2019-10-04 16:45:25 (86164): Guest Log: Failed to check CVMFS, check output from cvmfs_config probe:
2019-10-04 16:45:25 (86164): Guest Log: Probing /cvmfs/atlas.cern.ch... Failed!
2019-10-04 16:45:36 (86164): Guest Log: Probing /cvmfs/atlas-condb.cern.ch... OK


Despite the error a few lines later this:
2019-10-04 16:48:10 (86164): Guest Log:  *** Starting ATLAS job. (PandaID=4495700049 taskID=19056301) ***


No top output at tty3 (or at any other tty).

The task requested lots of files from lcgft-atlas.gridpp.rl.ac.uk which usually indicates it had started correctly but after 25 min there's still no finished event reported at tty2.
ID: 6744 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 482
Credit: 394,720
RAC: 0
Message 6746 - Posted: 4 Oct 2019, 20:04:11 UTC - in response to Message 6744.  

... there's still no finished event reported at tty2.

Now the progress log appears at tty2 - after more than 1400 s to finish the 1st event.
Still no top output.
ID: 6746 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6748 - Posted: 8 Oct 2019, 13:49:20 UTC

I just made version 0.87 which is the final(!) test before deploying in production. This one takes the bootstrap script from CVMFS instead of my personal web site. Assuming there are no problems I'll use this image in production.

PS The top on console 3 works for me but each refresh slowly scrolls up the screen, which is a bit annoying. I'm not sure exactly how to fix this.
ID: 6748 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6749 - Posted: 8 Oct 2019, 14:12:49 UTC - in response to Message 6748.  
Last modified: 8 Oct 2019, 14:15:14 UTC

Assuming there are no problems I'll use this image in production.
Please change the <rsc_disk_bound of 8000000000.000000 to 10000000000.000000 before going into production. See my message

I got my first event after 1932 seconds on tty2
ID: 6749 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6750 - Posted: 8 Oct 2019, 15:11:45 UTC - in response to Message 6749.  
Last modified: 8 Oct 2019, 15:12:04 UTC

Assuming there are no problems I'll use this image in production.
Please change the rsc_disk_bound of 8000000000.000000 to 10000000000.000000 before going into production.


Thanks for the reminder. I've done that right now, because the disk limit is set when WU are submitted, so we need all the unsent WU in the queue to have the new value before the new app version is released.
ID: 6750 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 819
Message 6751 - Posted: 9 Oct 2019, 4:57:33 UTC
Last modified: 9 Oct 2019, 5:06:53 UTC

The new .vdi (0.87) have a size of 1.07 GByte in Windows for Downloading. 2 min Downloadtime :-).
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2829147
The vboxwrapper is 26198ab7.
ID: 6751 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6752 - Posted: 9 Oct 2019, 6:01:38 UTC - in response to Message 6750.  

Assuming there are no problems I'll use this image in production.
Please change the rsc_disk_bound of 8000000000.000000 to 10000000000.000000 before going into production.


Thanks for the reminder. I've done that right now, because the disk limit is set when WU are submitted, so we need all the unsent WU in the queue to have the new value before the new app version is released.
Thanks, that was necessary. I suspended a task with almost 60% done (118 events) and total slot's space on disk was 8,052,105,216 bytes.

Taking almost 29 seconds on an idle system:
15:21:11.076331 Console: Machine state changed to 'Saving'
15:21:39.928831 Console: Machine state changed to 'Saved'
ID: 6752 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6764 - Posted: 17 Oct 2019, 19:23:11 UTC
Last modified: 17 Oct 2019, 19:24:04 UTC

Something wrong with the server?

lhcathome-dev 17 Oct 21:18:52 Giving up on download of 7OBLDmFwUbvnShfckohDCDFpABFKDmABFKDmMX2WDmABFKDmo2sdXm_EVNT.18605762._000224.pool.root.1: permanent HTTP error
lhcathome-dev 17 Oct 21:20:58 Giving up on download of ZEIODmuL3ZvnShfckohDCDFpABFKDmABFKDmDoiUDmABFKDmZy8cum_EVNT.18605557._000447.pool.root.1: permanent HTTP error
lhcathome-dev 17 Oct 21:23:02 Giving up on download of RFBLDmIiXcvnShfckohDCDFpABFKDmABFKDmxbeYDmABFKDm6X4tgn_EVNT.18605806._000117.pool.root.1: permanent HTTP error
ID: 6764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6765 - Posted: 18 Oct 2019, 9:05:55 UTC
Last modified: 18 Oct 2019, 9:06:32 UTC

14 hours later:

lhcathome-dev 18 Oct 11:04:34 Giving up on download of QcoMDmDrAavnShfckohDCDFpABFKDmABFKDmRbuUDmABFKDmflLBMo_EVNT.18605557._000447.pool.root.1: permanent HTTP error
ID: 6765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6768 - Posted: 21 Oct 2019, 13:38:25 UTC

I think the tasks remaining here were already cancelled upstream but not cancelled properly in BOINC, so that's why the input files were deleted on the server. I have manually cancelled all these tasks.

Today I re-activated older versions of the vbox apps to debug the new "top" console that caused problems on the production project last week. I also submitted some test WU but beware these WU are likely to hang forever.
ID: 6768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 819
Message 6769 - Posted: 21 Oct 2019, 16:22:52 UTC

This task:
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1946347
is running in Virtualbox 6.0.12. RDP shows when ALT+F2 localhost login: ^[[[B. ALT+F3 shows ^[[[C
The Atlas-Version is 0.86.
ID: 6769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6770 - Posted: 21 Oct 2019, 16:29:41 UTC

A new task downloaded v0.84 vdi, but application running is v0.86.
Consoles doesn't show anything informative. Task using ~22% CPU, should be 50% when 4 athena's running.

Aborted the task https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2831883
ID: 6770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6771 - Posted: 22 Oct 2019, 10:21:25 UTC

Now it looks like the problems are fixed so new tasks should work. Please abort any tasks you downloaded up to now since they will never finish.
ID: 6771 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 859,751
RAC: 36
Message 6772 - Posted: 22 Oct 2019, 13:52:26 UTC - in response to Message 6771.  
Last modified: 22 Oct 2019, 14:28:06 UTC

I got one running - No Unsent left. Console ALT-F3 ('top') is working. I'm able to give input to show tasks e.g. from user 'atlas' only.
4 athena's started after ~10 minutes uptime.
For Console ALT-F2 I have to wait another half an hour. Guessing the return of the task will be tomorrow afternoon.

Edit: The runtime of the events from PandaID=4002876565 taskID=000649-2 seems to be a bit shorter, so return will be earlier.
It looks like this was 'only' a test task with 10 events.
ID: 6772 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : ATLAS Application : Testing CentOS 7 vbox image


©2024 CERN