Message boards :
Theory Application :
Suspend/Resume
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
A new version has just been released which I hope will make suspend/resume to work. For it to work you will need to download two files with wget: sudo wget http://lhcathomedev.cern.ch/lhcathome-dev/download/create-boinc-cgroup -O /sbin/create-boinc-cgroup sudo wget http://lhcathomedev.cern.ch/lhcathome-dev/download/boinc-client.service -O /etc/systemd/system/boinc-client.service Then run the following commands to pick up the changes: sudo systemctl daemon-reload sudo systemctl restart boinc-client Without doing this everything should still work apart from suspend/resume. Edit: Note that this is so far untested. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Tested and it doesn't work. Cranky exits and the task finished. Will revert back to the previous version. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Just pushed out a new version and have tested that suspend resume works. I still need to verify that the tasks terminates correctly. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
My first result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759485 LAIM = 'Leave applications in memory' Suspend/Resume LAIM on: OK Suspend/Resume LAIM off: OK, but processes stayed in memory, so not freeing RAM BOINC stop / restart: Processes gone, but after restart the same job starts from the beginning. First try before stopping BOINC with the before mentioned suspends/resumes ===> [runRivet] Mon Mar 11 12:48:13 UTC 2019 [boinc ppbar mb-inelastic 500 - - pythia8 8.186 default 100000 28] Second try after restarting BOINC: ===> [runRivet] Mon Mar 11 12:57:48 UTC 2019 [boinc ppbar mb-inelastic 500 - - pythia8 8.186 default 100000 28] |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Thanks for testing. My first result: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759485 It looks like the signal sent is the same so there is no way to differentiate.
This I expect to happen. It is the same issue for a reboot test. We would have to checkpoint rather than pause but it looks like this feature is not yet supported in rootless containers. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
I did some other tests; now also with saving my Linux VM to the host's disk with a running task. With this one I corrupted the stderr.txt (stayed in memory), but after restoring the VM, the task runs on and validated OK. Here I suspended the task and saved the VM to disk. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2759508 After VM-restore and task-resume the task continues. Minor detail: the resume container time was off, because the VM-clock was updated after the write to stderr.txt. Remarks: Line wrapper (7.15.26016): starting is written twice and Container 'runc' finished with status code 0 is written with [ERROR] in front. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
When suspending 1 task with more tasks running, all tasks/containers are paused, but it's not always written in every slot stderr.txt |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
When suspending 1 task with more tasks running, all tasks/containers are paused, I have just tested this scenario this too and can confirm your observation. My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. On a different topic it looks like at the moment it is not possible to checkpoint a rootless container. This mean for now we can only pause in memory a container until this feature is supported. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. I stand corrected. It is runc that has the issue. Very strange. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
I knew it wasn't the wrapper. Each task has its own wrapper running.My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. When resuming the single task, but the core is meanwhile occupied by another project task (WCG), the task turns from suspended to waiiting to run. All other paused Theory's stay paused until the one waiting gets a chance to run again (suspended the WCG-task). |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
I knew it wasn't the wrapper. Each task has its own wrapper running.My guess that the issue is in the BONIC wrapper script rather than what we are doing. Will investigate. A new version is available this fixes this problem. The issue was that both containers were using the same cgroup and the freeze operation is done on the cgroup level. A cgroup is now used per container/slot. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Note that I tested the native ATLAS application and suspending the task results in it disappearing from the process table. Resuming the task starts it from the beginning. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
A new version is available this fixes this problem.Confirmation that with cranky-0.0.28 suspend/resume works on task level. Congratulation, Laurence. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
I'm already using my own cgroup tree to control cpu shares for my various boinc clients. This results in something like this: /boinc.slice #main slice to configure the overall share /boinc.slice/boinc-atlas.slice /boinc.slice/boinc-atlas.slice/boinc3_atlas.service # this slice represents user "boinc3" running ATLAS /boinc.slice/boinc-atlas.slice/boinc4_atlas.service # this slice represents user "boinc4" running ATLAS /boinc.slice/boinc-cpu.slice /boinc.slice/boinc-cpu.slice/boinc1_cpu.service # this slice represents user "boinc1" running non LHC projects /boinc.slice/boinc-cpu.slice/boinc2_cpu.service # this slice represents user "boinc2" running non LHC projects /boinc.slice/boinc-theory.slice /boinc.slice/boinc-theory.slice/boinc3_theory.service # this slice represents user "boinc3" running Theory/CMS /boinc.slice/boinc-theory.slice/boinc4_theory.service # this slice represents user "boinc4" running Theory/CMS /boinc.slice/boinc-test.slice /boinc.slice/boinc-test.slice/boinc9_test.service # this slice represents user "boinc9" running LHC-dev Different slices are required to keep the balance among different projects. Different users are required to keep the balance among clients using VBox. Otherwise the first client that starts a VBox task would set up a slice and all following clients would run their VM in that slice instead of their "home slice". It would be perfect if Theory native could be configured to be part of this tree, e.g.: /boinc.slice/boinc-theory_native.slice /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service # this slice represents user "boincx" running Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slot1.slice # generated by Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slot2.slice # generated by Theory native /boinc.slice/boinc-theory_native.slice/boincx_theory_native.service/slotn.slice # generated by Theory native |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It would be perfect if Theory native could be configured to be part of this tree, e.g.: I think that the use of cgroups in the BOINC client is a general issue that needs discussing. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
On my opensuse system net_cls and net_prio are links to "net_cls,net_prio" like cpu and cpuacct are links to "cpu,cpuacct". Hence, shouldn't the CGROUPS array in the script http://lhcathomedev.cern.ch/lhcathome-dev/download/create-boinc-cgroup look like this? CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) The same script may be used to prepare the cranky slots for a real single core setting. Add the following lines at the end of the script: chown root:boinc "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" chmod g+rw "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" The following modifications have to be done in cranky's "function create_cgroup()": like above: CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) At the end of the function add: cat "$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_period_us" >"$CGROUP_MOUNT/cpu,cpuacct/$CGROUP_PATH/cpu.cfs_quota_us" Should result in no more "cycle stealing". |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 61 |
...The same script may be used to prepare the cranky slots for a real single core setting. Good digging, computezrmle. Do we want to avoid cycle stealing? I just like it. In my opinion the job may use idle cycles on BOINC nice level even when the total cpu usage will be more than elapsed run time. It just means that the CPU is not used optimal. When rivetvm.exe and plotter.exe can use the spare cycles, it's perfect. When you have as many tasks as threads, this stealing will average between the jobs. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
...The same script may be used to prepare the cranky slots for a real single core setting. I expected that comment. Yes, I see pros and cons. Pro: Let the computer do the work as fast as possible. If there are free cycles, use them. Con: The BOINC client, the wrapper (or both?) can't really deal with apps that are booked in as 1 core app but use more ressources than calculated. Example: A user knows that his 8 core computer is able to run 5 tasks without overheating. Now, run 5 Theorys each claiming an average of 1.4 cores => 7 cores. This may result in more heat but the BOINC client and the user are not even aware of that. BTW: Just saw that "freezer" appears twice in the CGROUP array. CGROUPS=( freezer cpuset devices memory "cpu,cpuacct" pids blkio hugetlb "net_cls,net_prio" perf_event freezer ) |
Send message Joined: 28 Mar 24 Posts: 7 Credit: 12,604 RAC: 0 |
on ubuntu 22.04 my computer also works with MilkyWay@home 8-CPU tasks and i noticed when LHC@home-dev tasks become suspended, they still use lots of CPU On my 8-cpu machine load average goes to 15+ grep from boinccmd --get_tasks: project URL: https://lhcathomedev.cern.ch/lhcathome-dev/ active_task_state: SUSPENDED current CPU time: 23275.890000 fraction done: 0.837751 project URL: https://milkyway.cs.rpi.edu/milkyway/ active_task_state: EXECUTING current CPU time: 1141.166000 fraction done: 0.437930 grep from htop (2nd column is CPU usage) S 0.0 0.3 4:20.04 ├─ boinc --daemon R 256. 0.1 24:30.12 │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h histogr R 35.2 0.1 3:12.10 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 31.9 0.1 3:11.36 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.9 0.1 3:10.74 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.2 0.1 3:09.77 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 32.5 0.1 3:09.30 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist R 35.9 0.1 3:09.25 │ │ ├─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist S 0.0 0.1 0:00.08 │ │ └─ ../../projects/milkyway.cs.rpi.edu_milkyway/milkyway_nbody_1.83_x86_64-pc-linux-gnu__mt -f nbody_parameters.lua -h hist S 0.0 0.0 0:20.49 │ ├─ ../../projects/lhcathomedev.cern.ch_lhcathome-dev/wrapper_2019_03_02_x86_64-linux S 0.0 0.0 0:03.09 │ │ ├─ ../../projects/lhcathomedev.cern.ch_lhcathome-dev/wrapper_2019_03_02_x86_64-linux S 0.0 0.0 0:00.05 │ │ └─ /bin/bash ../../projects/lhcathomedev.cern.ch_lhcathome-dev/cranky-0.1.4 S 0.0 0.0 0:00.02 │ │ └─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.01 │ │ ├─ /bin/bash ./job S 0.0 0.0 0:10.25 │ │ │ └─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.00 │ │ │ ├─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.01 │ │ │ │ └─ /bin/bash ./rungen.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 /shared/tmp/tmp.ArGknGQnH2/gene R 98.9 0.2 6h29:54 │ │ │ │ └─ /cvmfs/sft.cern.ch/lcg/releases/LCG_96/MCGenerators/sherpa/2.2.9/x86_64-centos7-gcc8-opt/bin/Sherpa - S 0.0 0.0 0:00.00 │ │ │ ├─ /bin/bash ./runRivet.sh boinc pp z1j 8000 - - sherpa 2.2.9 default 5000 43 S 0.0 0.0 0:00.58 │ │ │ │ └─ /shared/rivetvm/rivetvm.exe -a ATLAS_2019_I1744201 -i /shared/tmp/tmp.ArGknGQnH2/generator.hepmc -o /sha S 0.0 0.0 0:00.00 │ │ │ └─ sleep 3 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ ├─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 S 0.0 0.0 0:00.00 │ │ └─ /cvmfs/grid.cern.ch/vc/containers/runc.new --root state run -b cernvm Theory_2743-2857775-43_2 i'm not sure if it somehow influences to quality of results or not, but just wanted to mention this |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
Please make your computers visible for other volunteers here: https://lhcathomedev.cern.ch/lhcathome-dev/prefs.php?subset=project |
©2024 CERN