Message boards : Theory Application : Status
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6083 - Posted: 24 Feb 2019, 21:46:34 UTC

The current priority is to get the native Theory app production ready. The recent experience on dev suggests that it should be a separate app to the VM apps as they have different requirements at least for memory and disk. The two main improvements needed are:

  1. Fix suspend/resume
  2. Detect bad hosts and restrict the number of jobs sent.


If there is anything else, please let me know. We can revisit the VM apps once the native app is solid.

ID: 6083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6087 - Posted: 25 Feb 2019, 7:39:17 UTC

The VM-apps have a hard kill after 18 hours runtime.
How to handle an endless looping science application within the native application?
You could reduce the value of rsc_fpops_bound.

The current settings are
<rsc_fpops_est>3600000000000.000000</rsc_fpops_est>
<rsc_fpops_bound>6000000000000000000.000000</rsc_fpops_bound>

The bound value is 1666667 times the estimated value; way too high.
100 times would be a better value to kill endless looping tasks, although crunchers will whine about the lost CPU-time and not to forget the credits.
ID: 6087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6134 - Posted: 5 Mar 2019, 8:56:20 UTC - in response to Message 6083.  

A new version has just been released that should detect the suspend/resume requests and print a message in the stderr log.
ID: 6134 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 6135 - Posted: 5 Mar 2019, 9:58:44 UTC - in response to Message 6134.  

Wouldn't this require at least an updated wrapper to send the right signal, hence fresh downloads?
I didn't get new application files today and the app list still shows version 4.18 (native_theory).

A recently started task still ignores the suspend signal.
ID: 6135 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6136 - Posted: 5 Mar 2019, 10:55:08 UTC

I suppose a new cranky version should do the trick.
My last task was sent at 10:18:55 UTC and still version 0.0.24 is running.
Nothing new in product directory.
ID: 6136 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6138 - Posted: 5 Mar 2019, 12:08:03 UTC - in response to Message 6136.  

Sorry the application failed to update. The new version is there now.
ID: 6138 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6142 - Posted: 5 Mar 2019, 13:48:01 UTC

Cranky 0.0.25 running. I'm confused.
I have only 1 task running in BOINC, but with top it looks like at least 2 jobs are running and several other processes.

top - 14:40:45 up  5:28,  1 user,  load average: 5,35, 3,66, 2,35
Tasks: 230 total,   4 running, 226 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0,6 us, 20,6 sy, 73,2 ni,  5,4 id,  0,1 wa,  0,0 hi,  0,2 si,  0,0 st
MiB Mem :   5960,3 total,   3285,9 free,    929,6 used,   1744,8 buff/cache
MiB Swap:   1186,4 total,   1186,4 free,      0,0 used.   4692,2 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND                 
31712 boinc     39  19   41812  20952  11796 R  92,1   0,3   1:02.62 pythia8.exe             
22108 boinc     39  19  289324   4928    936 R  76,8   0,1  50:16.23 rivetvm.exe             
22217 boinc     39  19  320516  10924   4664 S  51,0   0,2  36:57.65 pythia8.exe             
22109 boinc     39  19   23420   7612   1672 S  33,4   0,1   1:48.04 runRivet.sh             
21935 boinc     39  19   18932   3052   1608 R  30,5   0,1   1:50.37 runRivet.sh             
20925 boinc     39  19  291264  16244  10500 S  11,9   0,3   0:10.79 rivetvm.exe             
 1370 boinc     30  10  240624  16624  13092 S   0,7   0,3   0:42.04 boinc                   
 7910 boinc     39  19  609124   6872   2288 S   0,0   0,1   0:00.04 runc                    
 7953 boinc     39  19   17728    200      0 S   0,0   0,0   0:00.02 job                     
 8144 boinc     39  19   18664   1752    636 S   0,0   0,0   0:00.10 runRivet.sh             
10967 boinc     39  19    4132     36      0 S   0,0   0,0   0:00.00 sleep                   
20924 boinc     39  19   18256    792      0 S   0,0   0,0   0:00.05 rungen.sh               
20926 boinc     39  19   18664   1752    632 S   0,0   0,0   0:00.01 runRivet.sh             
21908 boinc     39  19  609124   5864   1128 S   0,0   0,1   0:00.05 runc                    
21918 boinc     39  19   17728    204      0 S   0,0   0,0   0:00.01 job                     
22107 boinc     39  19   18256    792      4 S   0,0   0,0   0:00.04 rungen.sh               
22738 boinc     39  19    4132    184    144 S   0,0   0,0   0:00.00 sleep                   
27314 boinc     30  10    6408   2932   2576 S   0,0   0,0   0:00.21 wrapper_2019_03         
27321 boinc     39  19   20256   3384   2992 S   0,0   0,1   0:00.02 cranky-0.0.25
ID: 6142 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6143 - Posted: 5 Mar 2019, 14:00:31 UTC
Last modified: 5 Mar 2019, 14:07:23 UTC

Suspending the task in BOINC creates an immediate finish of the job, including a result uploaded and validated OK.

I don't think this is was you want.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757459 and a previous one, I didn't expect to be ready
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757413


and one to prove it
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2757464

The processes keep on running, but the task is reported to the server and validated OK.
That explains why I was seeing so many processes.
ID: 6143 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 6144 - Posted: 5 Mar 2019, 14:06:05 UTC - in response to Message 6142.  

If it's not inside a VM you may check the process relationship with pstree as shown here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=456&postid=6126
The example there shows a couple of singlecore tasks rather than multicore.
ID: 6144 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6145 - Posted: 5 Mar 2019, 14:10:49 UTC - in response to Message 6144.  
Last modified: 5 Mar 2019, 14:12:32 UTC

If it's not inside a VM you may check the process relationship with pstree as shown here:
https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.php?id=456&postid=6126
The example there shows a couple of singlecore tasks rather than multicore.

See my follow up post, just before yours.
Suspending is finishing a task, uploaded and keeps the processes running.

So far only tested with "keep application in memory"
ID: 6145 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6147 - Posted: 5 Mar 2019, 15:05:13 UTC - in response to Message 6143.  


I don't think this is was you want.

No. I have deprecated this version. At least it shows that the suspend signal can ce caught.
ID: 6147 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,620
RAC: 3,116
Message 6148 - Posted: 5 Mar 2019, 16:49:23 UTC

ATM no Theory tasks are avalaible, Server say more than 100.
ID: 6148 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6150 - Posted: 5 Mar 2019, 18:21:43 UTC - in response to Message 6148.  

ATM no Theory tasks are avalaible, Server say more than 100.


The server needed a restart to pick up the old version. I have stopped the validator to investigate an issue.
ID: 6150 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,620
RAC: 3,116
Message 6151 - Posted: 6 Mar 2019, 6:05:18 UTC
Last modified: 6 Mar 2019, 6:14:51 UTC

Sorry Laurence,
but since this night 3 UTC, no new work avalaible. Server say 0 new tasks.
The old tasks have no points and are waiting for the confirming on the Server.

Edit: The points are now avalaible. Thank you.
ID: 6151 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 664
Credit: 1,791,620
RAC: 3,116
Message 6152 - Posted: 6 Mar 2019, 7:48:02 UTC - in response to Message 6151.  

Edit: The points are now avalaible. Thank you.

Status: Alle (804) · In Bearbeitung (2) · Überprüfung ausstehend (42) · Überprüfung ohne Ergebnis (0) · Gültig (726) · Ungültig (1) · Fehler (33)
Anwendung: All (804) · ATLAS Simulation (16) · CMS Simulation (0) · Theory Simulation (788)
ID: 6152 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6153 - Posted: 6 Mar 2019, 14:39:06 UTC - in response to Message 6147.  


I don't think this is was you want.

No. I have deprecated this version. At least it shows that the suspend signal can be caught.

I am looking into pausing the container and have run into an issue. You can list the running container with the following command:
sudo /cvmfs/grid.cern.ch/vc/containers/runc --root /var/lib/boinc-client/slots/0/cernvm/ list

This should return something like:
ID                                  PID         STATUS      BUNDLE                          CREATED                          OWNER
Theory_859210_1543416190.499432_0   17060       running     /var/lib/boinc-client/slots/0/cernvm   2019-03-06T13:39:09.912409154Z   boinc

It should be possible to pause the container with the following command:
sudo /cvmfs/grid.cern.ch/vc/containers/runc --root /var/lib/boinc-client/slots/0/cernvm/ pause Theory_859210_1543416190.499432_0

But I am getting the following error:
no such directory for freezer.state


If anyone has any ideas, please let me know.
ID: 6153 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 62
Message 6154 - Posted: 6 Mar 2019, 15:01:03 UTC - in response to Message 6153.  
Last modified: 6 Mar 2019, 15:02:29 UTC

If anyone has any ideas, please let me know.

A quick search gave me this:
https://www.kernel.org/doc/Documentation/cgroup-v1/freezer-subsystem.txt
https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/6/html/resource_management_guide/sec-freezer

Comments:
1. Didn't test it yet
2. It's for cgroups v1

<edit>
typo
;-(
</edit>
ID: 6154 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6156 - Posted: 6 Mar 2019, 19:00:04 UTC - in response to Message 6153.  

If anyone has any ideas, please let me know.

Just guessing: I suppose you are using docker containers?

When, one could install docker and use: docker (un)pause "container" or is that too simple. . .
ID: 6156 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 327,239
RAC: 129
Message 6160 - Posted: 7 Mar 2019, 9:10:51 UTC - in response to Message 6156.  

If anyone has any ideas, please let me know.

Just guessing: I suppose you are using docker containers?

When, one could install docker and use: docker (un)pause "container" or is that too simple. . .


No. Containers are essentially a feature of the kernel implemented using cgroups. Docker uses this and does other things as well but is implemented as a service. We are tying a more simplistic approach that doesn't require any dependencies by using runc. It supports pause and resume but at the moment is not working as expected
ID: 6160 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 431
Message 6161 - Posted: 7 Mar 2019, 9:48:50 UTC - in response to Message 6153.  
Last modified: 7 Mar 2019, 10:05:49 UTC

It should be possible to pause the container with the following command:
sudo /cvmfs/grid.cern.ch/vc/containers/runc --root /var/lib/boinc-client/slots/0/cernvm/ pause Theory_859210_1543416190.499432_0

But I am getting the following error:
no such directory for freezer.state


If anyone has any ideas, please let me know.
Did you give the above command as user Laurence or as user boinc?
runc pause may fail if you don't have the full access to cgroups
ID: 6161 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : Theory Application : Status


©2024 CERN