1) Message boards : Theory Application : Theory v.5.21 (Message 7048)
Posted 11 May 2020 by computezrmle
Post:
TP1: the following line appears on ALT-F1:
21:05:26 CEST +02:00 2020-05-06: cranky-0.0.31: [INFO] ===> [runRivet] Wed May  6 19:05:24 UTC 2020 [boinc pp jets 7000 250 - pythia8 8.301 dire-default 100000 0]

Now, the task knows what software is required and d/l and compiling starts.


From the same example task, TP2 is this point when setup and compiling the apps has been finished.
It can be seen at ALT-F2:
0 events processed
dumping histograms...
.
.
.
100 events processed
dumping histograms...



It's an estimation rather than an accurate calculation since:
- there are some small downloads before TP1 and TP2
- the amount of downloads varies depending on the task parameters (pythia6, pythia8, sherpa, ...)

At the end (TP2 - TP1) should be compared in magnitudes like:
- 3-5 min
- 15-20 min
- 35-40 min
- 60-70 min

The times should then be checked against the nominal bandwidth to see if it makes sense and fits together.
Example:
It's not possible to d/l 200 MB within 5 min if you have just 2 Mbit/s.
2) Message boards : Theory Application : Theory v.5.21 (Message 7045)
Posted 11 May 2020 by computezrmle
Post:
This tells us that the VM's CVMFS is very robust.

What you define is a new trigger point (TP1):
[INFO] ===> [runRivet] Time/date XXX [boinc pp jets 7000 10 - event generatorXXX appears in the logfile.

Do you also watch ALT-F2 to get the trigger point I mentioned (TP2)?
If you resume just 1 single task and check the runtime between TP1 and TP2 I would expect it to be around 1/2 h (may vary depending on the type of the event generator) if your d/l bandwidth is 1.5 Mbit/s.

The more tasks you resume concurrently the more they share the d/l bandwidth during this phase and I would expect an increasing runtime until TP2 is reached.

After TP2 the tasks usually don't need much internet bandwidth.
They just refresh the CVMFS catalogs periodically every 4 min.
Each refresh d/l not more than a few kB.
3) Message boards : Theory Application : Theory v.5.21 (Message 7043)
Posted 10 May 2020 by computezrmle
Post:
Let's compare some numbers.

Over at the production project I ran a typical Theory Vbox task:
[INFO] ===> [runRivet] Sun May 10 08:11:39 UTC 2020 [boinc pp jets 7000 10 - pythia8 8.301 tune-2m 100000 2]

To allow some estimations I'd like to define the following trigger point:
Console ALT-F2 shows that the task is starting event calculation.


When the trigger point is reached, measure
1. the time since the VM has been started: 3 min 30 s
2. the downloaded data since the VM has been started: 384 MB


On my system (1.) includes (roughly):
- 30 s download time
- 3 min basic VM setup and time to compile rivetvm/pythia


Estimation 1: Download times (best case)

To get 384 MB downloaded (without a local proxy) it will take:
31 s via a 100 Mbit/s connection
62 s via a 50 Mbit/s connection
310 s (5 min 10 s) via a 10 Mbit/s connection
775 s (12 min 55 s) via a 4 Mbit/s connection
1550 s (25 min 50 s) via a 2 Mbit/s connection



Estimation 2: Running many tasks concurrently

For the time being the (production-)server stats show an average Theory task runtime of 2.94 h (~10600 s)
To ensure each task gets full download speed during it's startup phase, divide 10600 by the download times per task.
Then round down to the next integer.
100 Mbit/s: 341
50 Mbit/s: 170
10 Mbit/s: 34
4 Mbit/s: 13
2 Mbit/s: 6



Conclusion

To get the setup of 1 single Theory task done within a reasonable time (say, ~15 min) internet download bandwidth available for the task should not be less than 4 Mbit/s.
To avoid this connection gets saturated not more than 13 tasks should run concurrently.
4) Message boards : Theory Application : Theory v.5.21 (Message 7035)
Posted 9 May 2020 by computezrmle
Post:
Vbox tasks

A couple of results show that the CVMFS changes don't crash the tasks.
In addition the expected log entries from "cvmfs_config stat" appear in stderr.txt.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900265
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900376
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900267
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900432
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900384

Let's see if CVMFS gets more robust on saturated internet connections with high latencies.



Native tasks

Output from "cvmfs_config stat" should also appear in stderr.txt but is missing.
I'm not sure if this is already implemented.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900555
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2900357
5) Message boards : Theory Application : Theory v.5.20 (Message 7024)
Posted 27 Apr 2020 by computezrmle
Post:
All Theory tasks make lots of HTTP requests to CVMFS repositories until the scientific app has been set up and starts calculating events (See ALT-F2).
After that until a task finishes there are small refresh requests to cernvm-prod.cern.ch/.cvmfspublished and sft.cern.ch/.cvmfspublished which happen regularly every 4 minutes.

Other repositories, e.g. grid.cern.ch or alice.cern.ch, are disconnected during this phase but may need to be reconnected at the end.
If internet response times are - even just temporarily - too high while a refresh or a reconnect happens this may cause a task to fail.
Typical message: "cvmfs_config probe xyz.cern.ch failed"

In most cases high response times happen on the 1st section between the home router and the ISP's connection node.
They usually occur if that section is busy with lots of other requests.
6) Message boards : ATLAS Application : ATLAS native 1.01 (Message 6987)
Posted 6 Feb 2020 by computezrmle
Post:
1.03 works fine.
It prints the loglines immediately to stderr.txt.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868929
7) Message boards : ATLAS Application : ATLAS native 1.01 (Message 6985)
Posted 6 Feb 2020 by computezrmle
Post:
... but almost the whole stderr.txt is filled when the job is ready.

May be due to awk is using I/O buffers in non-interactive mode.
Sent a patch to David on github.
If this doesn't fill stderr.txt earlier the reason is most likely I/O-buffering in other scripts which would mean much more effort.
8) Message boards : ATLAS Application : ATLAS native 1.01 (Message 6982)
Posted 6 Feb 2020 by computezrmle
Post:
This task succeeded:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868726
9) Message boards : ATLAS Application : ATLAS native 1.01 (Message 6980)
Posted 6 Feb 2020 by computezrmle
Post:
Hmm, it doesn't seem to work for me: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868679

Same here:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868682

Just checked your original code and noticed that you sent the output to stderr.
Let's try to modify line 20 like this:
exec &> >(awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }' 1>&2)
10) Message boards : ATLAS Application : ATLAS Monitoring v4 (Message 6878)
Posted 4 Dec 2019 by computezrmle
Post:
No errors reported for a couple of days.
Hence I suggest to deploy v4.1.0 on the prod server.
11) Message boards : ATLAS Application : ATLAS Monitoring v4 (Message 6867)
Posted 29 Nov 2019 by computezrmle
Post:
V4.1.0 is ready for testing and there are enough tasks available on the dev server.
I already ran a couple of them and didn't notice any obvious issues but this post from CP shows that it might be good if others also take a look.
12) Message boards : ATLAS Application : ATLAS Monitoring v4 (Message 6861)
Posted 28 Nov 2019 by computezrmle
Post:
I just sent a github pull request to David Cameron to update the ATLAS Monitoring to v4.1.0.

The UI did not change but this version is split into different files to make it independent from the ATLAS wrapper script and more maintenance friendly.
Beside that the issue mentioned here should be solved:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=498&postid=6838
13) Message boards : General Discussion : Scheduler Change (Message 6847)
Posted 26 Nov 2019 by computezrmle
Post:
I see what sched_send.cpp does:

base CPU count is set to the processor capabilities
n = g_reply->host.p_ncpus;



reduced if a user has set a CPU percentage limit
if (g_request->global_prefs.max_ncpus_pct && g_request->global_prefs.max_ncpus_pct < 100) {
    n = (int)((n*g_request->global_prefs.max_ncpus_pct)/100.);
}



reduced based on <ncpus> from cc_config.xml (?, guess yes) (if < than the %limit)
if (n > config.max_ncpus) n = config.max_ncpus;



sanity checks (not sure where MAX_CPUS is set; looks like a global setting)
if (n < 1) n = 1;
if (n > MAX_CPUS) n = MAX_CPUS;



So far it makes sense as the settings above limit the global number of cores this BOINC client is allowed to use and the server can now estimate how much work can be send.




if (project_prefs.max_cpus) {
    if (n > project_prefs.max_cpus) {
        n = project_prefs.max_cpus;
    }

This code is here, without any doubt, but why here?
Guess a user wants to run 2-core ATLAS tasks and sets the web preferences to "2 CPUs".
This code will make every brand new 256-core-amdintel_ripper appear as a Fred Flintstone's 2 core machine.
14) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6841)
Posted 18 Nov 2019 by computezrmle
Post:
I'll ignore it in future, but as volunteer tester I had to report it ;)

I agree.

Just tested a quick fix but it had negative impact while the task was starting.
Needs a bit more investigation but shouldn't stop publishing v3.2.0.
15) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6839)
Posted 18 Nov 2019 by computezrmle
Post:
Finished a few 3-core and a few 1-core tasks with monitoring v3.2.0.
All of them ran like a Swiss clockwork. ;-)

You have to look quick, but when the HITS-line is deleted (almost) at the end of the task, just before the VM is stopped and RDP closed
'uncertainty' shows -1; must be a leap second and not a real issue ;)

Saw this only once during all the tests.
It happens while the VM is about to shut down all processes and the monitoring can't get valid data any more.
I suggest to ignore it since a few seconds later the VM shuts down anyway.
16) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6837)
Posted 18 Nov 2019 by computezrmle
Post:
Finished a few 3-core and a few 1-core tasks with monitoring v3.2.0.
All of them ran like a Swiss clockwork. ;-)
17) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6834)
Posted 16 Nov 2019 by computezrmle
Post:
Last minute change request.


v3.1.0 uses standard deviation to calculate uncertainty.
This covers 68.27% (2 sigma) of the runtime values' standard distribution.

To cover 99.73% (6 sigma) standard deviation must be multiplied with 3.

Did a few local simulations with 6 sigma and it looks like the resulting values make more "sense" when compared to other values on the display, e.g. arithmetic mean or min/max.

Sent a github pull request to David that includes the change to v3.2.0.



Beside that the recent version runs fine and I suggest to deploy it on the production server.
18) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6825)
Posted 14 Nov 2019 by computezrmle
Post:
Just created a pull request at github to update to v3.1.0

changes

Output string
- "Number of events to be processed" to "Number of events total"
- "overdue" to "overtime"

While HITS file generation is in progress
- "overtime" will not appear any more
- "Time left total" will remain 0 (zero)
19) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6822)
Posted 14 Nov 2019 by computezrmle
Post:
Overtime Message

Forgot to mention:
The same term is shown while the HITS file is in preparation.
Would you prefer to
- leave it blank and just count the remaining time upwards (confusing?)
- display the same term
- display a different term
20) Message boards : ATLAS Application : ATLAS Monitoring v3 (Message 6819)
Posted 14 Nov 2019 by computezrmle
Post:
"to be processed: 200" sounds like this is the number of remaining events, it should be made clearer that this is the total so I would just say "total" here

OK
Will postpone the change until its decided what to do with the next one.


"overdue" might make people worry that there is something wrong with the task, I'm not sure it's a good idea to have this, or maybe there can be a clearer message what it means

What about "overtime" or "extra time"?
Most people may know that from sports.

Just to mention it, nothing to worry about:
This text field has a fixed width.
If we extend it the whole layout will be a bit wider.


Next 20


©2021 CERN