Thread 'Theory v.5.21'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 7032 - Posted: 8 May 2020, 16:15:01 UTC This new version updates the CVMFS configuration for the VM apps to improve the robustness with respect to temporary network issues. Some additional CVMFS information is shown in the output for debugging purposes. ID: 7032 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,504 RAC: 1,557	Message 7033 - Posted: 9 May 2020, 4:53:13 UTC - in response to Message 7032. Thanks Laurence, this Computer is running this new Version with squid, Computezrmle can look deeper in the task: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3765 ID: 7033 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,504 RAC: 1,557	Message 7034 - Posted: 9 May 2020, 7:00:16 UTC Last modified: 9 May 2020, 7:31:47 UTC Sorry, my fault, this can be deleted. ID: 7034 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7035 - Posted: 9 May 2020, 15:51:03 UTC Vbox tasks A couple of results show that the CVMFS changes don't crash the tasks. In addition the expected log entries from "cvmfs_config stat" appear in stderr.txt. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900265 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900376 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900267 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900432 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900384 Let's see if CVMFS gets more robust on saturated internet connections with high latencies. Native tasks Output from "cvmfs_config stat" should also appear in stderr.txt but is missing. I'm not sure if this is already implemented. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900555 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900357 ID: 7035 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7036 - Posted: 9 May 2020, 19:03:02 UTC Last modified: 9 May 2020, 19:04:03 UTC I have the vdi update finished since I started loading it last night and I see they are all ready this morning here. My isp is running so slow that I may just watch for it to speed up a little before I try more and do one pc at a time. My high-speed month starts on the 13th but usually I can still start these up one at a time but running real slow right now. I accidently had two start while I was asleep and as usual they Failed. I'll try more today and as always I do speed checks first. (400Kbps right now) cranky: [ERROR] 'cvmfs_config probe grid.cern.ch' failed. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900339 ID: 7036 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7037 - Posted: 9 May 2020, 20:32:23 UTC No luck so far.....and why I will only try one at a time right now. I will check the speed all day but it may be more likely after midnight (if I had a better isp company I would make that Dish a bird bath) ID: 7037 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7038 - Posted: 10 May 2020, 4:44:31 UTC Last modified: 10 May 2020, 4:47:47 UTC FINALLY at 9:20pm I get 4 of them to start running I guess my internet Dish could tell I was watching some BBC on the other Dish and figured I was in London Maybe I will try to get another 24 cores running after Last of the Summer Wine ID: 7038 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 40	Message 7039 - Posted: 10 May 2020, 8:58:19 UTC Last modified: 10 May 2020, 9:35:07 UTC No real problems with this Theory version. I had some errors, but they were due to version 6.1.6 of VirtualBox. Now and then an open VirtualBox Manager is crashing when Theory VM's should be cleaned at the end of their runtime and sometimes it's dragging along other running VM's. I downgraded to VirtualBox version 6.0.20. Until now the Manager seems to run more stable. Theory tasks: https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=37&offset=0&show_names=0&state=0&appid=4 ID: 7039 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7040 - Posted: 10 May 2020, 9:10:35 UTC Last modified: 10 May 2020, 9:12:27 UTC Just finished 2 Valids and two more may be getting close (4.5 hours so far) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900409 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900390 2am here but if these two finish soon I will start up 24 more (VB Version: 6.1.4) ID: 7040 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7041 - Posted: 10 May 2020, 10:39:41 UTC Well those 4 tasks I got to start about 5.5 hours ago worked Valid BUT even at 3:30am I can't get one to start and rather not stay up trying for one to get beyond this.......may try in my morning again or just have to wait until one second after midnight on the 12th when I get the crooks to give me high-speed until they throttle down this ____internet that even fails speed tests. ID: 7041 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 40	Message 7042 - Posted: 10 May 2020, 13:10:25 UTC - in response to Message 7041. Last modified: 10 May 2020, 13:11:13 UTC Following job had: [INFO] Container 'runc' finished with status code 1. pp top-mc 7000 - - pythia6 6.427 373 100000 2 Also the first attempt of this job description in revision 2390 failed. I've still 16 tasks in progress. ID: 7042 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7043 - Posted: 10 May 2020, 18:19:42 UTC Let's compare some numbers. Over at the production project I ran a typical Theory Vbox task: [INFO] ===> [runRivet] Sun May 10 08:11:39 UTC 2020 [boinc pp jets 7000 10 - pythia8 8.301 tune-2m 100000 2] To allow some estimations I'd like to define the following trigger point: Console ALT-F2 shows that the task is starting event calculation. When the trigger point is reached, measure 1. the time since the VM has been started: 3 min 30 s 2. the downloaded data since the VM has been started: 384 MB On my system (1.) includes (roughly): - 30 s download time - 3 min basic VM setup and time to compile rivetvm/pythia Estimation 1: Download times (best case) To get 384 MB downloaded (without a local proxy) it will take: 31 s via a 100 Mbit/s connection 62 s via a 50 Mbit/s connection 310 s (5 min 10 s) via a 10 Mbit/s connection 775 s (12 min 55 s) via a 4 Mbit/s connection 1550 s (25 min 50 s) via a 2 Mbit/s connection Estimation 2: Running many tasks concurrently For the time being the (production-)server stats show an average Theory task runtime of 2.94 h (~10600 s) To ensure each task gets full download speed during it's startup phase, divide 10600 by the download times per task. Then round down to the next integer. 100 Mbit/s: 341 50 Mbit/s: 170 10 Mbit/s: 34 4 Mbit/s: 13 2 Mbit/s: 6 Conclusion To get the setup of 1 single Theory task done within a reasonable time (say, ~15 min) internet download bandwidth available for the task should not be less than 4 Mbit/s. To avoid this connection gets saturated not more than 13 tasks should run concurrently. ID: 7043 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7044 - Posted: 10 May 2020, 19:21:31 UTC The problem I always have is not the number of tasks that I can run at the same time but just getting ONE to start running. When I have enough speed to start any of them I can run run 40 of these at the same time with no problems with the internet speed. As long as I have better than 1.5Mbps d/l speed I can start 2 or 3 at a time and after 5 minutes running I would suspend them and do the same thing until I got all the cores started and then suspended and in fact can do that with more than 60 tasks so once that is done I just restart ALL of them and they all will run Valid and that way I know they will start running more without me wondering if they will Fail or not. It always works that way for me so it is not internet speed dependent after they get past [INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX With a satellite isp the d/l is what slows down and the u/l stays fairly fast all the time. ID: 7044 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7045 - Posted: 11 May 2020, 9:54:02 UTC - in response to Message 7044. This tells us that the VM's CVMFS is very robust. What you define is a new trigger point (TP1): [INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX appears in the logfile. Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)? If you resume just 1 single task and check the runtime between TP1 and TP2 I would expect it to be around 1/2 h (may vary depending on the type of the event generator) if your d/l bandwidth is 1.5 Mbit/s. The more tasks you resume concurrently the more they share the d/l bandwidth during this phase and I would expect an increasing runtime until TP2 is reached. After TP2 the tasks usually don't need much internet bandwidth. They just refresh the CVMFS catalogs periodically every 4 min. Each refresh d/l not more than a few kB. ID: 7045 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 40	Message 7046 - Posted: 11 May 2020, 13:39:07 UTC - in response to Message 7045. Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)? What would you define as TP2? In the running.log I see 3 timestamps. ===> [runRivet] Sun May 10 16:31:05 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0] Setting environment... . . . ===> [rungen] Sun May 10 16:35:21 UTC 2020 [boinc pp ttbar 7000 - - herwig7 7.1.5 default 100000 0 /shared/tmp/tmp.GOtRsRrFCE/generator.hepmc] Setting environment for herwig7 7.1.5 ... . . . >>>> Sun May 10 16:41:11 2020 <<<< <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >> Herwig 7.1.5 / ThePEG 2.1.5 Rivet.Analysis.Handler: INFO Using named weights 0 events processed ID: 7046 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 40	Message 7047 - Posted: 11 May 2020, 13:43:34 UTC Not sure whether this is introduced in this version, but there are no longer 'nice figures' shown when using 'Show graphics' in BOINC Manager. The running.log is still reachable via that way. ID: 7047 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7048 - Posted: 11 May 2020, 14:16:45 UTC - in response to Message 7046. he following line appears on ALT-F1: [pre]21:05:26 CEST +02:00 2020-05-06: cranky-0.0.31: [INFO] ===> [runRivet] Wed May 6 19:05:24 UTC 2020 [boinc pp jets 7000 250 - pythia8 8.301 dire-default 100000 0][/pre] Now, the task knows what software is required and d/l and compiling starts. From the same example task, TP2 is this point when setup and compiling the apps has been finished. It can be seen at ALT-F2: [pre]0 events processed dumping histograms... . . . 100 events processed dumping histograms...[/pre] It's an estimation rather than an accurate calculation since: - there are some small downloads before TP1 and TP2 - the amount of downloads varies depending on the task parameters (pythia6, pythia8, sherpa, ...) At the end (TP2 - TP1) should be compared in magnitudes like: - 3-5 min - 15-20 min - 35-40 min - 60-70 min The times should then be checked against the nominal bandwidth to see if it makes sense and fits together. Example: It's not possible to d/l 200 MB within 5 min if you have just 2 Mbit/s. ID: 7048 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 40	Message 7049 - Posted: 11 May 2020, 14:36:39 UTC Last modified: 11 May 2020, 14:37:21 UTC OK, thus in my previous example from a running.log it's the 1st and 3rd timestamp. 10 minutes for that herwig. For a pythia8 in my environment it was 5m7s. This doesn't count when it's a sherpa. ID: 7049 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7050 - Posted: 11 May 2020, 20:30:10 UTC I have mentioned this many times since I run them all the time since the beginning but I decided to take a snapshot to save for later when I was talking about this with another member. I watch each one load in the VM Console until they get to running but also watch the first 30 seconds since they can also Fail there as I have posted several times (with snapshots) and watch the clock where they have to get to the CVMFS start page on the Console. And have watched ALL of the different event generator versions and there are more than pythia,herwig,and sherpa I also check them in the log on the VB Manager Here is the snapshot of how I do this and this is just one 8-core pc and I do the same thing on all of the hosts I have running here and then restart them all and they never Fail and when the first 8 are finished it goes right to the next 8 tasks when I can get this many ready to run. ID: 7050 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1020 Credit: 18,602,940 RAC: 20,567	Message 7051 - Posted: 13 May 2020, 10:22:44 UTC Good morning (3:15am PDT) My monthly high-speed isp started 3 hours ago and as usual it is a pain in the ___ at first so I did end up with 10 that Fail to make it to runRivet ....BUT I have 82 loaded and started up and ready to Resume......after 8am so I don't waste some of that 2am to 8am "Bonus" I save up to load these up while I have that extra 50MB remaining since my day time monthly 10MB will be gone in less than 48 hours since these will all be running then. Ok its bedtime for me now ID: 7051 · Rating: 0 · rate: / Reply Quote

Development for LHC@home