Message boards : Theory Application : Theory v.5.21
Message board moderation
Author | Message |
---|---|
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1114 Credit: 339,209 RAC: 43 ![]() |
This new version updates the CVMFS configuration for the VM apps to improve the robustness with respect to temporary network issues. Some additional CVMFS information is shown in the output for debugging purposes. |
Send message Joined: 22 Apr 16 Posts: 709 Credit: 2,114,314 RAC: 6,943 ![]() ![]() ![]() |
Thanks Laurence, this Computer is running this new Version with squid, Computezrmle can look deeper in the task: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3765 |
Send message Joined: 22 Apr 16 Posts: 709 Credit: 2,114,314 RAC: 6,943 ![]() ![]() ![]() |
Sorry, my fault, this can be deleted. |
![]() Send message Joined: 28 Jul 16 Posts: 511 Credit: 400,710 RAC: 160 ![]() ![]() |
Vbox tasks A couple of results show that the CVMFS changes don't crash the tasks. In addition the expected log entries from "cvmfs_config stat" appear in stderr.txt. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900265 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900376 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900267 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900432 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900384 Let's see if CVMFS gets more robust on saturated internet connections with high latencies. Native tasks Output from "cvmfs_config stat" should also appear in stderr.txt but is missing. I'm not sure if this is already implemented. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900555 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900357 |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
I have the vdi update finished since I started loading it last night and I see they are all ready this morning here. My isp is running so slow that I may just watch for it to speed up a little before I try more and do one pc at a time. My high-speed month starts on the 13th but usually I can still start these up one at a time but running real slow right now. I accidently had two start while I was asleep and as usual they Failed. I'll try more today and as always I do speed checks first. (400Kbps right now) cranky: [ERROR] 'cvmfs_config probe grid.cern.ch' failed. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900339 |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
No luck so far.....and why I will only try one at a time right now. I will check the speed all day but it may be more likely after midnight (if I had a better isp company I would make that Dish a bird bath) ![]() |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
FINALLY at 9:20pm I get 4 of them to start running I guess my internet Dish could tell I was watching some BBC on the other Dish and figured I was in London Maybe I will try to get another 24 cores running after Last of the Summer Wine ![]() ![]() |
Send message Joined: 13 Feb 15 Posts: 1207 Credit: 889,924 RAC: 545 ![]() ![]() ![]() |
No real problems with this Theory version. I had some errors, but they were due to version 6.1.6 of VirtualBox. Now and then an open VirtualBox Manager is crashing when Theory VM's should be cleaned at the end of their runtime and sometimes it's dragging along other running VM's. I downgraded to VirtualBox version 6.0.20. Until now the Manager seems to run more stable. Theory tasks: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=37&offset=0&show_names=0&state=0&appid=4 |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
Just finished 2 Valids and two more may be getting close (4.5 hours so far) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900409 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900390 2am here but if these two finish soon I will start up 24 more (VB Version: 6.1.4) |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
Well those 4 tasks I got to start about 5.5 hours ago worked Valid BUT even at 3:30am I can't get one to start and rather not stay up trying for one to get beyond this.......may try in my morning again or just have to wait until one second after midnight on the 12th when I get the crooks to give me high-speed until they throttle down this ____internet that even fails speed tests. ![]() |
Send message Joined: 13 Feb 15 Posts: 1207 Credit: 889,924 RAC: 545 ![]() ![]() ![]() |
Following job had: [INFO] Container 'runc' finished with status code 1. pp top-mc 7000 - - pythia6 6.427 373 100000 2 Also the first attempt of this job description in revision 2390 failed. I've still 16 tasks in progress. |
![]() Send message Joined: 28 Jul 16 Posts: 511 Credit: 400,710 RAC: 160 ![]() ![]() |
Let's compare some numbers. Over at the production project I ran a typical Theory Vbox task: [INFO] ===> [runRivet] Sun May 10 08:11:39 UTC 2020 [boinc pp jets 7000 10 - pythia8 8.301 tune-2m 100000 2] To allow some estimations I'd like to define the following trigger point: Console ALT-F2 shows that the task is starting event calculation. When the trigger point is reached, measure 1. the time since the VM has been started: 3 min 30 s 2. the downloaded data since the VM has been started: 384 MB On my system (1.) includes (roughly): - 30 s download time - 3 min basic VM setup and time to compile rivetvm/pythia Estimation 1: Download times (best case) To get 384 MB downloaded (without a local proxy) it will take: 31 s via a 100 Mbit/s connection 62 s via a 50 Mbit/s connection 310 s (5 min 10 s) via a 10 Mbit/s connection 775 s (12 min 55 s) via a 4 Mbit/s connection 1550 s (25 min 50 s) via a 2 Mbit/s connection Estimation 2: Running many tasks concurrently For the time being the (production-)server stats show an average Theory task runtime of 2.94 h (~10600 s) To ensure each task gets full download speed during it's startup phase, divide 10600 by the download times per task. Then round down to the next integer. 100 Mbit/s: 341 50 Mbit/s: 170 10 Mbit/s: 34 4 Mbit/s: 13 2 Mbit/s: 6 Conclusion To get the setup of 1 single Theory task done within a reasonable time (say, ~15 min) internet download bandwidth available for the task should not be less than 4 Mbit/s. To avoid this connection gets saturated not more than 13 tasks should run concurrently. |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
The problem I always have is not the number of tasks that I can run at the same time but just getting ONE to start running. When I have enough speed to start any of them I can run run 40 of these at the same time with no problems with the internet speed. As long as I have better than 1.5Mbps d/l speed I can start 2 or 3 at a time and after 5 minutes running I would suspend them and do the same thing until I got all the cores started and then suspended and in fact can do that with more than 60 tasks so once that is done I just restart ALL of them and they all will run Valid and that way I know they will start running more without me wondering if they will Fail or not. It always works that way for me so it is not internet speed dependent after they get past [INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX With a satellite isp the d/l is what slows down and the u/l stays fairly fast all the time. |
![]() Send message Joined: 28 Jul 16 Posts: 511 Credit: 400,710 RAC: 160 ![]() ![]() |
This tells us that the VM's CVMFS is very robust. What you define is a new trigger point (TP1): [INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX appears in the logfile. Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)? If you resume just 1 single task and check the runtime between TP1 and TP2 I would expect it to be around 1/2 h (may vary depending on the type of the event generator) if your d/l bandwidth is 1.5 Mbit/s. The more tasks you resume concurrently the more they share the d/l bandwidth during this phase and I would expect an increasing runtime until TP2 is reached. After TP2 the tasks usually don't need much internet bandwidth. They just refresh the CVMFS catalogs periodically every 4 min. Each refresh d/l not more than a few kB. |
Send message Joined: 13 Feb 15 Posts: 1207 Credit: 889,924 RAC: 545 ![]() ![]() ![]() |
Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)?What would you define as TP2? In the running.log I see 3 timestamps. ===> [runRivet] Sun May 10 16:31:05 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0] Setting environment... . . . ===> [rungen] Sun May 10 16:35:21 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0 /shared/tmp/tmp.GOtRsRrFCE/generator.hepmc] Setting environment for herwig7 7.1.5 ... . . . >>>> Sun May 10 16:41:11 2020 <<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >> Herwig 7.1.5 / ThePEG 2.1.5 Rivet.Analysis.Handler: INFO Using named weights 0 events processed |
Send message Joined: 13 Feb 15 Posts: 1207 Credit: 889,924 RAC: 545 ![]() ![]() ![]() |
Not sure whether this is introduced in this version, but there are no longer 'nice figures' shown when using 'Show graphics' in BOINC Manager. The running.log is still reachable via that way. |
![]() Send message Joined: 28 Jul 16 Posts: 511 Credit: 400,710 RAC: 160 ![]() ![]() |
TP1: the following line appears on ALT-F1: 21:05:26 CEST +02:00 2020-05-06: cranky-0.0.31: [INFO] ===> [runRivet] Wed May 6 19:05:24 UTC 2020 [boinc pp jets 7000 250 - pythia8 8.301 dire-default 100000 0] Now, the task knows what software is required and d/l and compiling starts. From the same example task, TP2 is this point when setup and compiling the apps has been finished. It can be seen at ALT-F2: 0 events processed dumping histograms... . . . 100 events processed dumping histograms... It's an estimation rather than an accurate calculation since: - there are some small downloads before TP1 and TP2 - the amount of downloads varies depending on the task parameters (pythia6, pythia8, sherpa, ...) At the end (TP2 - TP1) should be compared in magnitudes like: - 3-5 min - 15-20 min - 35-40 min - 60-70 min The times should then be checked against the nominal bandwidth to see if it makes sense and fits together. Example: It's not possible to d/l 200 MB within 5 min if you have just 2 Mbit/s. |
Send message Joined: 13 Feb 15 Posts: 1207 Credit: 889,924 RAC: 545 ![]() ![]() ![]() |
OK, thus in my previous example from a running.log it's the 1st and 3rd timestamp. 10 minutes for that herwig. For a pythia8 in my environment it was 5m7s. This doesn't count when it's a sherpa. |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
I have mentioned this many times since I run them all the time since the beginning but I decided to take a snapshot to save for later when I was talking about this with another member. I watch each one load in the VM Console until they get to running but also watch the first 30 seconds since they can also Fail there as I have posted several times (with snapshots) and watch the clock where they have to get to the CVMFS start page on the Console. And have watched ALL of the different event generator versions and there are more than pythia,herwig,and sherpa I also check them in the log on the VB Manager Here is the snapshot of how I do this and this is just one 8-core pc and I do the same thing on all of the hosts I have running here and then restart them all and they never Fail and when the first 8 are finished it goes right to the next 8 tasks when I can get this many ready to run. ![]() |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 793 Credit: 13,515,617 RAC: 10,488 ![]() ![]() ![]() |
Good morning (3:15am PDT) My monthly high-speed isp started 3 hours ago and as usual it is a pain in the ___ at first so I did end up with 10 that Fail to make it to runRivet ....BUT I have 82 loaded and started up and ready to Resume......after 8am so I don't waste some of that 2am to 8am "Bonus" I save up to load these up while I have that extra 50MB remaining since my day time monthly 10MB will be gone in less than 48 hours since these will all be running then. Ok its bedtime for me now ![]() |
©2025 CERN