Thread 'Suspend/Resume'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6186 - Posted: 11 Mar 2019, 9:57:59 UTC Last modified: 11 Mar 2019, 13:29:54 UTC A new version has just been released which I hope will make suspend/resume to work. For it to work you will need to download two files with wget: sudo wget http://lhcathomedev.cern.ch/lhcathome-dev/download/create-boinc-cgroup -O /sbin/create-boinc-cgroup sudo wget http://lhcathomedev.cern.ch/lhcathome-dev/download/boinc-client.service -O /etc/systemd/system/boinc-client.service Then run the following commands to pick up the changes: sudo systemctl daemon-reload sudo systemctl restart boinc-client Without doing this everything should still work apart from suspend/resume. Edit: Note that this is so far untested. ID: 6186 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6187 - Posted: 11 Mar 2019, 10:04:36 UTC - in response to Message 6186. Tested and it doesn't work. Cranky exits and the task finished. Will revert back to the previous version. ID: 6187 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6189 - Posted: 11 Mar 2019, 12:18:00 UTC - in response to Message 6187. Just pushed out a new version and have tested that suspend resume works. I still need to verify that the tasks terminates correctly. ID: 6189 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6190 - Posted: 11 Mar 2019, 13:14:39 UTC Last modified: 11 Mar 2019, 13:20:14 UTC My first result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759485 LAIM = 'Leave applications in memory' Suspend/Resume LAIM on: OK Suspend/Resume LAIM off: OK, but processes stayed in memory, so not freeing RAM BOINC stop / restart: Processes gone, but after restart the same job starts from the beginning. First try before stopping BOINC with the before mentioned suspends/resumes ===> [runRivet] Mon Mar 11 12:48:13 UTC 2019 [boinc ppbar mb-inelastic 500 - - pythia8 8.186 default 100000 28] Second try after restarting BOINC: ===> [runRivet] Mon Mar 11 12:57:48 UTC 2019 [boinc ppbar mb-inelastic 500 - - pythia8 8.186 default 100000 28] ID: 6190 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6191 - Posted: 11 Mar 2019, 13:29:06 UTC - in response to Message 6190. Thanks for testing. My first result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759485 LAIM = 'Leave applications in memory' Suspend/Resume LAIM on: OK Suspend/Resume LAIM off: OK, but processes stayed in memory, so not freeing RAM It looks like the signal sent is the same so there is no way to differentiate. BOINC stop / restart: Processes gone, but after restart the same job starts from the beginning. This I expect to happen. It is the same issue for a reboot test. We would have to checkpoint rather than pause but it looks like this feature is not yet supported in rootless containers. ID: 6191 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6192 - Posted: 11 Mar 2019, 14:57:50 UTC Last modified: 11 Mar 2019, 15:38:57 UTC I did some other tests; now also with saving my Linux VM to the host's disk with a running task. With this one I corrupted the stderr.txt (stayed in memory), but after restoring the VM, the task runs on and validated OK. Here I suspended the task and saved the VM to disk. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759508 After VM-restore and task-resume the task continues. Minor detail: the resume container time was off, because the VM-clock was updated after the write to stderr.txt. Remarks: Line wrapper (7.15.26016): starting is written twice and Container 'runc' finished with status code 0 is written with [ERROR] in front. ID: 6192 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6193 - Posted: 11 Mar 2019, 15:54:04 UTC Last modified: 11 Mar 2019, 16:43:25 UTC When suspending 1 task with more tasks running, all tasks/containers are paused, but it's not always written in every slot stderr.txt ID: 6193 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6194 - Posted: 12 Mar 2019, 8:54:32 UTC - in response to Message 6193. When suspending 1 task with more tasks running, all tasks/containers are paused, but it's not always written in every slot stderr.txt I have just tested this scenario this too and can confirm your observation. My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. On a different topic it looks like at the moment it is not possible to checkpoint a rootless container. This mean for now we can only pause in memory a container until this feature is supported. ID: 6194 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6195 - Posted: 12 Mar 2019, 9:01:37 UTC - in response to Message 6194. My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. I stand corrected. It is runc that has the issue. Very strange. ID: 6195 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6197 - Posted: 12 Mar 2019, 9:40:19 UTC - in response to Message 6195. My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. I stand corrected. It is runc that has the issue. Very strange. I knew it wasn't the wrapper. Each task has its own wrapper running. When resuming the single task, but the core is meanwhile occupied by another project task (WCG), the task turns from suspended to waiiting to run. All other paused Theory's stay paused until the one waiting gets a chance to run again (suspended the WCG-task). ID: 6197 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6198 - Posted: 12 Mar 2019, 10:18:59 UTC - in response to Message 6197. My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. I stand corrected. It is runc that has the issue. Very strange. I knew it wasn't the wrapper. Each task has its own wrapper running. When resuming the single task, but the core is meanwhile occupied by another project task (WCG), the task turns from suspended to waiiting to run. All other paused Theory's stay paused until the one waiting gets a chance to run again (suspended the WCG-task). A new version is available this fixes this problem. The issue was that both containers were using the same cgroup and the freeze operation is done on the cgroup level. A cgroup is now used per container/slot. ID: 6198 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6199 - Posted: 12 Mar 2019, 10:20:23 UTC - in response to Message 6198. Note that I tested the native ATLAS application and suspending the task results in it disappearing from the process table. Resuming the task starts it from the beginning. ID: 6199 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6200 - Posted: 12 Mar 2019, 10:53:46 UTC - in response to Message 6198. A new version is available this fixes this problem. Confirmation that with cranky-0.0.28 suspend/resume works on task level. Congratulation, Laurence. ID: 6200 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 6201 - Posted: 12 Mar 2019, 11:09:39 UTC I'm already using my own cgroup tree to control cpu shares for my various boinc clients. This results in something like this: /boinc.slice #main slice to configure the overall share /boinc.slice/boinc-atlas.slice /boinc.slice/boinc-atlas.slice/boinc3_atlas.service # this slice represents user "boinc3" running ATLAS /boinc.slice/boinc-atlas.slice/boinc4_atlas.service # this slice represents user "boinc4" running ATLAS /boinc.slice/boinc-cpu.slice /boinc.slice/boinc-cpu.slice/boinc1_cpu.service # this slice represents user "boinc1" running non LHC projects /boinc.slice/boinc-cpu.slice/boinc2_cpu.service # this slice represents user "boinc2" running non LHC projects /boinc.slice/boinc-theory.slice /boinc.slice/boinc-theory.slice/boinc3_theory.service # this slice represents user "boinc3" running Theory/CMS /boinc.slice/boinc-theory.slice/boinc4_theory.service # this slice represents user "boinc4" running Theory/CMS /boinc.slice/boinc-test.slice /boinc.slice/boinc-test.slice/boinc9_test.service # this slice represents user "boinc9" running LHC-dev Different slices are required to keep the balance among different projects. Different users are required to keep the balance among clients using VBox. Otherwise the first client that starts a VBox task would set up a slice and all following clients would run their VM in that slice instead of their "home slice". It would be perfect if Theory native could be configured to be part of this tree, e.g.: /boinc.slice/boinc-theory_native.slice /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service # this slice represents user "boincx" running Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slot1.slice # generated by Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slot2.slice # generated by Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slotn.slice # generated by Theory native ID: 6201 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1132 Credit: 339,231 RAC: 0	Message 6203 - Posted: 12 Mar 2019, 12:00:13 UTC - in response to Message 6201. It would be perfect if Theory native could be configured to be part of this tree, e.g.: I think that the use of cgroups in the BOINC client is a general issue that needs discussing. ID: 6203 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 6263 - Posted: 26 Mar 2019, 15:02:19 UTC On my opensuse system net_cls and net_prio are links to "net_cls,net_prio" like cpu and cpuacct are links to "cpu,cpuacct". Hence, shouldn't the CGROUPS array in the script http://lhcathomedev.cern.ch/lhcathome-dev/download/create-boinc-cgroup look like this? CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) The same script may be used to prepare the cranky slots for a real single core setting. Add the following lines at the end of the script: chown root:boinc "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" chmod g+rw "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" The following modifications have to be done in cranky's "function create_cgroup()": like above: CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) At the end of the function add: cat "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_period_us" >"$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" Should result in no more "cycle stealing". ID: 6263 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1226 Credit: 942,877 RAC: 214	Message 6266 - Posted: 26 Mar 2019, 19:09:56 UTC - in response to Message 6263. ...The same script may be used to prepare the cranky slots for a real single core setting. .. .. Should result in no more "cycle stealing". Good digging, computezrmle. Do we want to avoid cycle stealing? I just like it. In my opinion the job may use idle cycles on BOINC nice level even when the total cpu usage will be more than elapsed run time. It just means that the CPU is not used optimal. When rivetvm.exe and plotter.exe can use the spare cycles, it's perfect. When you have as many tasks as threads, this stealing will average between the jobs. ID: 6266 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 6267 - Posted: 26 Mar 2019, 19:31:26 UTC - in response to Message 6266. ...The same script may be used to prepare the cranky slots for a real single core setting. .. .. Should result in no more "cycle stealing". Good digging, computezrmle. Do we want to avoid cycle stealing? I just like it. In my opinion the job may use idle cycles on BOINC nice level even when the total cpu usage will be more than elapsed run time. It just means that the CPU is not used optimal. When rivetvm.exe and plotter.exe can use the spare cycles, it's perfect. When you have as many tasks as threads, this stealing will average between the jobs. I expected that comment. Yes, I see pros and cons. Pro: Let the computer do the work as fast as possible. If there are free cycles, use them. Con: The BOINC client, the wrapper (or both?) can't really deal with apps that are booked in as 1 core app but use more ressources than calculated. Example: A user knows that his 8 core computer is able to run 5 tasks without overheating. Now, run 5 Theorys each claiming an average of 1.4 cores => 7 cores. This may result in more heat but the BOINC client and the user are not even aware of that. BTW: Just saw that "freezer" appears twice in the CGROUP array. CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) ID: 6267 · Rating: 0 · rate: / Reply Quote

rilian Send message Joined: 28 Mar 24 Posts: 7 Credit: 12,604 RAC: 0	Message 8390 - Posted: 5 Apr 2024, 20:43:05 UTC - in response to Message 6267. on ubuntu 22.04 my computer also works with MilkyWay@home 8-CPU tasks and i noticed when LHC@home-dev tasks become suspended, they still use lots of CPU On my 8-cpu machine load average goes to 15+ grep from boinccmd --get_tasks: project URL: https://lhcathomedev.cern.ch/lhcathome-dev/ active_task_state: SUSPENDED current CPU time: 23275.890000 fraction done: 0.837751 project URL: https://milkyway.cs.rpi.edu/milkyway/ active_task_state: EXECUTING current CPU time: 1141.166000 fraction done: 0.437930 grep from htop (2nd column is CPU usage) S 0.0 0.3 4:20.04 ├─ boinc --daemon R 256. 0.1 24:30.12 │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h histogr R 35.2 0.1 3:12.10 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 31.9 0.1 3:11.36 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.9 0.1 3:10.74 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.2 0.1 3:09.77 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 32.5 0.1 3:09.30 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.9 0.1 3:09.25 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist S 0.0 0.1 0:00.08 │ │ └─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist S 0.0 0.0 0:20.49 │ ├─ ../../projects/lhcathomedev.cern.ch_lhcathome-dev/wrapper_2019_03_02_x86_64-linux S 0.0 0.0 0:03.09 │ │ ├─ ../../projects/lhcathomedev.cern.ch_lhcathome-dev/wrapper_2019_03_02_x86_64-linux S 0.0 0.0 0:00.05 │ │ └─ /bin/bash ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.1.4 S 0.0 0.0 0:00.02 │ │ └─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.01 │ │ ├─ /bin/bash ./job S 0.0 0.0 0:10.25 │ │ │ └─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.00 │ │ │ ├─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.01 │ │ │ │ └─ /bin/bash ./rungen.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 /shared/tmp/tmp.ArGknGQnH2/gene R 98.9 0.2 6h29:54 │ │ │ │ └─ /cvmfs/sft.cern.ch/lcg/releases/LCG_96/MCGenerators/sherpa/2.2.9/x86_64-centos7-gcc8-opt/bin/Sherpa - S 0.0 0.0 0:00.00 │ │ │ ├─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.58 │ │ │ │ └─ /shared/rivetvm/rivetvm.exe -a ATLAS_2019_I1744201 -i /shared/tmp/tmp.ArGknGQnH2/generator.hepmc -o /sha S 0.0 0.0 0:00.00 │ │ │ └─ sleep 3 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ └─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 i'm not sure if it somehow influences to quality of results or not, but just wanted to mention this ID: 8390 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 519 Credit: 400,710 RAC: 1	Message 8391 - Posted: 5 Apr 2024, 21:27:53 UTC - in response to Message 8390. Please make your computers visible for other volunteers here: https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project ID: 8391 · Rating: 0 · rate: / Reply Quote

Development for LHC@home