Message boards :
Theory Application :
Status
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
The current priority is to get the native Theory app production ready. The recent experience on dev suggests that it should be a separate app to the VM apps as they have different requirements at least for memory and disk. The two main improvements needed are:
|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
The VM-apps have a hard kill after 18 hours runtime. How to handle an endless looping science application within the native application? You could reduce the value of rsc_fpops_bound. The current settings are <rsc_fpops_est>3600000000000.000000</rsc_fpops_est> <rsc_fpops_bound>6000000000000000000.000000</rsc_fpops_bound> The bound value is 1666667 times the estimated value; way too high. 100 times would be a better value to kill endless looping tasks, although crunchers will whine about the lost CPU-time and not to forget the credits. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
A new version has just been released that should detect the suspend/resume requests and print a message in the stderr log. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
Wouldn't this require at least an updated wrapper to send the right signal, hence fresh downloads? I didn't get new application files today and the app list still shows version 4.18 (native_theory). A recently started task still ignores the suspend signal. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
I suppose a new cranky version should do the trick. My last task was sent at 10:18:55 UTC and still version 0.0.24 is running. Nothing new in product directory. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Sorry the application failed to update. The new version is there now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
Cranky 0.0.25 running. I'm confused. I have only 1 task running in BOINC, but with top it looks like at least 2 jobs are running and several other processes. top - 14:40:45 up 5:28, 1 user, load average: 5,35, 3,66, 2,35 Tasks: 230 total, 4 running, 226 sleeping, 0 stopped, 0 zombie %Cpu(s): 0,6 us, 20,6 sy, 73,2 ni, 5,4 id, 0,1 wa, 0,0 hi, 0,2 si, 0,0 st MiB Mem : 5960,3 total, 3285,9 free, 929,6 used, 1744,8 buff/cache MiB Swap: 1186,4 total, 1186,4 free, 0,0 used. 4692,2 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 31712 boinc 39 19 41812 20952 11796 R 92,1 0,3 1:02.62 pythia8.exe 22108 boinc 39 19 289324 4928 936 R 76,8 0,1 50:16.23 rivetvm.exe 22217 boinc 39 19 320516 10924 4664 S 51,0 0,2 36:57.65 pythia8.exe 22109 boinc 39 19 23420 7612 1672 S 33,4 0,1 1:48.04 runRivet.sh 21935 boinc 39 19 18932 3052 1608 R 30,5 0,1 1:50.37 runRivet.sh 20925 boinc 39 19 291264 16244 10500 S 11,9 0,3 0:10.79 rivetvm.exe 1370 boinc 30 10 240624 16624 13092 S 0,7 0,3 0:42.04 boinc 7910 boinc 39 19 609124 6872 2288 S 0,0 0,1 0:00.04 runc 7953 boinc 39 19 17728 200 0 S 0,0 0,0 0:00.02 job 8144 boinc 39 19 18664 1752 636 S 0,0 0,0 0:00.10 runRivet.sh 10967 boinc 39 19 4132 36 0 S 0,0 0,0 0:00.00 sleep 20924 boinc 39 19 18256 792 0 S 0,0 0,0 0:00.05 rungen.sh 20926 boinc 39 19 18664 1752 632 S 0,0 0,0 0:00.01 runRivet.sh 21908 boinc 39 19 609124 5864 1128 S 0,0 0,1 0:00.05 runc 21918 boinc 39 19 17728 204 0 S 0,0 0,0 0:00.01 job 22107 boinc 39 19 18256 792 4 S 0,0 0,0 0:00.04 rungen.sh 22738 boinc 39 19 4132 184 144 S 0,0 0,0 0:00.00 sleep 27314 boinc 30 10 6408 2932 2576 S 0,0 0,0 0:00.21 wrapper_2019_03 27321 boinc 39 19 20256 3384 2992 S 0,0 0,1 0:00.02 cranky-0.0.25 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
Suspending the task in BOINC creates an immediate finish of the job, including a result uploaded and validated OK. I don't think this is was you want. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757459 and a previous one, I didn't expect to be ready https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757413 and one to prove it https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757464 The processes keep on running, but the task is reported to the server and validated OK. That explains why I was seeing so many processes. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
If it's not inside a VM you may check the process relationship with pstree as shown here: https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=456&postid=6126 The example there shows a couple of singlecore tasks rather than multicore. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
If it's not inside a VM you may check the process relationship with pstree as shown here: See my follow up post, just before yours. Suspending is finishing a task, uploaded and keeps the processes running. So far only tested with "keep application in memory" |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
No. I have deprecated this version. At least it shows that the suspend signal can ce caught. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
ATM no Theory tasks are avalaible, Server say more than 100. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
ATM no Theory tasks are avalaible, Server say more than 100. The server needed a restart to pick up the old version. I have stopped the validator to investigate an issue. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
Sorry Laurence, but since this night 3 UTC, no new work avalaible. Server say 0 new tasks. The old tasks have no points and are waiting for the confirming on the Server. Edit: The points are now avalaible. Thank you. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
Edit: The points are now avalaible. Thank you. Status: Alle (804) · In Bearbeitung (2) · Überprüfung ausstehend (42) · Überprüfung ohne Ergebnis (0) · Gültig (726) · Ungültig (1) · Fehler (33) Anwendung: All (804) · ATLAS Simulation (16) · CMS Simulation (0) · Theory Simulation (788) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I am looking into pausing the container and have run into an issue. You can list the running container with the following command: sudo /cvmfs/grid.cern.ch/vc/containers/runc --root /var/lib/boinc-client/slots/0/cernvm/ list This should return something like: ID PID STATUS BUNDLE CREATED OWNER Theory_859210_1543416190.499432_0 17060 running /var/lib/boinc-client/slots/0/cernvm 2019-03-06T13:39:09.912409154Z boinc It should be possible to pause the container with the following command: sudo /cvmfs/grid.cern.ch/vc/containers/runc --root /var/lib/boinc-client/slots/0/cernvm/ pause Theory_859210_1543416190.499432_0 But I am getting the following error: no such directory for freezer.state If anyone has any ideas, please let me know. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
If anyone has any ideas, please let me know. A quick search gave me this: https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-freezer Comments: 1. Didn't test it yet 2. It's for cgroups v1 <edit> typo ;-( </edit> |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
If anyone has any ideas, please let me know. Just guessing: I suppose you are using docker containers? When, one could install docker and use: docker (un)pause "container" or is that too simple. . . |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
If anyone has any ideas, please let me know. No. Containers are essentially a feature of the kernel implemented using cgroups. Docker uses this and does other things as well but is implemented as a service. We are tying a more simplistic approach that doesn't require any dependencies by using runc. It supports pause and resume but at the moment is not working as expected |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
It should be possible to pause the container with the following command:Did you give the above command as user Laurence or as user boinc? runc pause may fail if you don't have the full access to cgroups |
©2024 CERN