Message boards : Theory Application : Theory v.5.21
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 7032 - Posted: 8 May 2020, 16:15:01 UTC

This new version updates the CVMFS configuration for the VM apps to improve the robustness with respect to temporary network issues. Some additional CVMFS information is shown in the output for debugging purposes.
ID: 7032 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 7033 - Posted: 9 May 2020, 4:53:13 UTC - in response to Message 7032.  

Thanks Laurence,
this Computer is running this new Version with squid, Computezrmle can look deeper in the task:
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3765
ID: 7033 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 7034 - Posted: 9 May 2020, 7:00:16 UTC
Last modified: 9 May 2020, 7:31:47 UTC

Sorry, my fault, this can be deleted.
ID: 7034 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 7035 - Posted: 9 May 2020, 15:51:03 UTC

Vbox tasks

A couple of results show that the CVMFS changes don't crash the tasks.
In addition the expected log entries from "cvmfs_config stat" appear in stderr.txt.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900265
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900376
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900267
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900432
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900384

Let's see if CVMFS gets more robust on saturated internet connections with high latencies.



Native tasks

Output from "cvmfs_config stat" should also appear in stderr.txt but is missing.
I'm not sure if this is already implemented.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900555
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900357
ID: 7035 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7036 - Posted: 9 May 2020, 19:03:02 UTC
Last modified: 9 May 2020, 19:04:03 UTC

I have the vdi update finished since I started loading it last night and I see they are all ready this morning here.
My isp is running so slow that I may just watch for it to speed up a little before I try more and do one pc at a time.
My high-speed month starts on the 13th but usually I can still start these up one at a time but running real slow right now.

I accidently had two start while I was asleep and as usual they Failed.

I'll try more today and as always I do speed checks first. (400Kbps right now)

cranky: [ERROR] 'cvmfs_config probe grid.cern.ch' failed.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900339
ID: 7036 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7037 - Posted: 9 May 2020, 20:32:23 UTC

No luck so far.....and why I will only try one at a time right now.
I will check the speed all day but it may be more likely after midnight (if I had a better isp company I would make that Dish a bird bath)

ID: 7037 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7038 - Posted: 10 May 2020, 4:44:31 UTC
Last modified: 10 May 2020, 4:47:47 UTC

FINALLY at 9:20pm I get 4 of them to start running

I guess my internet Dish could tell I was watching some BBC on the other Dish and figured I was in London

Maybe I will try to get another 24 cores running after Last of the Summer Wine

ID: 7038 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 7039 - Posted: 10 May 2020, 8:58:19 UTC
Last modified: 10 May 2020, 9:35:07 UTC

No real problems with this Theory version.
I had some errors, but they were due to version 6.1.6 of VirtualBox.
Now and then an open VirtualBox Manager is crashing when Theory VM's should be cleaned at the end of their runtime and sometimes it's dragging along other running VM's.
I downgraded to VirtualBox version 6.0.20. Until now the Manager seems to run more stable.

Theory tasks: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=37&offset=0&show_names=0&state=0&appid=4
ID: 7039 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7040 - Posted: 10 May 2020, 9:10:35 UTC
Last modified: 10 May 2020, 9:12:27 UTC

Just finished 2 Valids and two more may be getting close (4.5 hours so far)

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900409
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900390

2am here but if these two finish soon I will start up 24 more
(VB Version: 6.1.4)
ID: 7040 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7041 - Posted: 10 May 2020, 10:39:41 UTC

Well those 4 tasks I got to start about 5.5 hours ago worked Valid
BUT even at 3:30am I can't get one to start and rather not stay up trying for one to get beyond this.......may try in my morning again or just have to wait until one second after midnight on the 12th when I get the crooks to give me high-speed until they throttle down this ____internet that even fails speed tests.
ID: 7041 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 7042 - Posted: 10 May 2020, 13:10:25 UTC - in response to Message 7041.  
Last modified: 10 May 2020, 13:11:13 UTC

Following job had: [INFO] Container 'runc' finished with status code 1.

pp top-mc 7000 - - pythia6 6.427 373 100000 2

Also the first attempt of this job description in revision 2390 failed.

I've still 16 tasks in progress.
ID: 7042 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 7043 - Posted: 10 May 2020, 18:19:42 UTC

Let's compare some numbers.

Over at the production project I ran a typical Theory Vbox task:
[INFO] ===> [runRivet] Sun May 10 08:11:39 UTC 2020 [boinc pp jets 7000 10 - pythia8 8.301 tune-2m 100000 2]

To allow some estimations I'd like to define the following trigger point:
Console ALT-F2 shows that the task is starting event calculation.


When the trigger point is reached, measure
1. the time since the VM has been started: 3 min 30 s
2. the downloaded data since the VM has been started: 384 MB


On my system (1.) includes (roughly):
- 30 s download time
- 3 min basic VM setup and time to compile rivetvm/pythia


Estimation 1: Download times (best case)

To get 384 MB downloaded (without a local proxy) it will take:
31 s via a 100 Mbit/s connection
62 s via a 50 Mbit/s connection
310 s (5 min 10 s) via a 10 Mbit/s connection
775 s (12 min 55 s) via a 4 Mbit/s connection
1550 s (25 min 50 s) via a 2 Mbit/s connection



Estimation 2: Running many tasks concurrently

For the time being the (production-)server stats show an average Theory task runtime of 2.94 h (~10600 s)
To ensure each task gets full download speed during it's startup phase, divide 10600 by the download times per task.
Then round down to the next integer.
100 Mbit/s: 341
50 Mbit/s: 170
10 Mbit/s: 34
4 Mbit/s: 13
2 Mbit/s: 6



Conclusion

To get the setup of 1 single Theory task done within a reasonable time (say, ~15 min) internet download bandwidth available for the task should not be less than 4 Mbit/s.
To avoid this connection gets saturated not more than 13 tasks should run concurrently.
ID: 7043 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7044 - Posted: 10 May 2020, 19:21:31 UTC

The problem I always have is not the number of tasks that I can run at the same time but just getting ONE to start running.
When I have enough speed to start any of them I can run run 40 of these at the same time with no problems with the internet speed.

As long as I have better than 1.5Mbps d/l speed I can start 2 or 3 at a time and after 5 minutes running I would suspend them and do the same thing until I got all the cores started and then suspended and in fact can do that with more than 60 tasks so once that is done I just restart ALL of them and they all will run Valid and that way I know they will start running more without me wondering if they will Fail or not.

It always works that way for me so it is not internet speed dependent after they get past [INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX

With a satellite isp the d/l is what slows down and the u/l stays fairly fast all the time.
ID: 7044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 7045 - Posted: 11 May 2020, 9:54:02 UTC - in response to Message 7044.  

This tells us that the VM's CVMFS is very robust.

What you define is a new trigger point (TP1):
[INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX appears in the logfile.

Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)?
If you resume just 1 single task and check the runtime between TP1 and TP2 I would expect it to be around 1/2 h (may vary depending on the type of the event generator) if your d/l bandwidth is 1.5 Mbit/s.

The more tasks you resume concurrently the more they share the d/l bandwidth during this phase and I would expect an increasing runtime until TP2 is reached.

After TP2 the tasks usually don't need much internet bandwidth.
They just refresh the CVMFS catalogs periodically every 4 min.
Each refresh d/l not more than a few kB.
ID: 7045 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 7046 - Posted: 11 May 2020, 13:39:07 UTC - in response to Message 7045.  

Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)?
What would you define as TP2?
In the running.log I see 3 timestamps.

===> [runRivet] Sun May 10 16:31:05 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0]

Setting environment...
.
.
.
===> [rungen] Sun May 10 16:35:21 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0 /shared/tmp/tmp.GOtRsRrFCE/generator.hepmc]

Setting environment for herwig7 7.1.5 ...
.
.
.
>>>> Sun May 10 16:41:11 2020                                             <<<<
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

>> Herwig 7.1.5 / ThePEG 2.1.5

Rivet.Analysis.Handler: INFO  Using named weights
0 events processed
ID: 7046 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 7047 - Posted: 11 May 2020, 13:43:34 UTC

Not sure whether this is introduced in this version, but there are no longer 'nice figures' shown when using 'Show graphics' in BOINC Manager.
The running.log is still reachable via that way.
ID: 7047 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 7048 - Posted: 11 May 2020, 14:16:45 UTC - in response to Message 7046.  

TP1: the following line appears on ALT-F1:
21:05:26 CEST +02:00 2020-05-06: cranky-0.0.31: [INFO] ===> [runRivet] Wed May  6 19:05:24 UTC 2020 [boinc pp jets 7000 250 - pythia8 8.301 dire-default 100000 0]

Now, the task knows what software is required and d/l and compiling starts.


From the same example task, TP2 is this point when setup and compiling the apps has been finished.
It can be seen at ALT-F2:
0 events processed
dumping histograms...
.
.
.
100 events processed
dumping histograms...



It's an estimation rather than an accurate calculation since:
- there are some small downloads before TP1 and TP2
- the amount of downloads varies depending on the task parameters (pythia6, pythia8, sherpa, ...)

At the end (TP2 - TP1) should be compared in magnitudes like:
- 3-5 min
- 15-20 min
- 35-40 min
- 60-70 min

The times should then be checked against the nominal bandwidth to see if it makes sense and fits together.
Example:
It's not possible to d/l 200 MB within 5 min if you have just 2 Mbit/s.
ID: 7048 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 7049 - Posted: 11 May 2020, 14:36:39 UTC
Last modified: 11 May 2020, 14:37:21 UTC

OK, thus in my previous example from a running.log it's the 1st and 3rd timestamp. 10 minutes for that herwig.
For a pythia8 in my environment it was 5m7s.
This doesn't count when it's a sherpa.
ID: 7049 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7050 - Posted: 11 May 2020, 20:30:10 UTC

I have mentioned this many times since I run them all the time since the beginning but I decided to take a snapshot to save for later when I was talking about this with another member.

I watch each one load in the VM Console until they get to running but also watch the first 30 seconds since they can also Fail there as I have posted several times (with snapshots) and watch the clock where they have to get to the CVMFS start page on the Console.

And have watched ALL of the different event generator versions and there are more than pythia,herwig,and sherpa

I also check them in the log on the VB Manager

Here is the snapshot of how I do this and this is just one 8-core pc and I do the same thing on all of the hosts I have running here and then restart them all and they never Fail and when the first 8 are finished it goes right to the next 8 tasks when I can get this many ready to run.

ID: 7050 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7051 - Posted: 13 May 2020, 10:22:44 UTC

Good morning (3:15am PDT)

My monthly high-speed isp started 3 hours ago and as usual it is a pain in the ___ at first so I did end up with 10 that Fail to make it to runRivet ....BUT I have 82 loaded and started up and ready to Resume......after 8am so I don't waste some of that 2am to 8am "Bonus" I save up to load these up while I have that extra 50MB remaining since my day time monthly 10MB will be gone in less than 48 hours since these will all be running then.

Ok its bedtime for me now
ID: 7051 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · Next

Message boards : Theory Application : Theory v.5.21


©2024 CERN