Message boards :
ATLAS Application :
ATLAS native 1.00
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 ![]() ![]() |
I suspect the SIGBUS problems come from something missing when running in a standard CentOS 7 host without singularity. So in 1.00 I've added singularity again for all operating systems. @maeax please let me know if this works for you. |
Send message Joined: 22 Apr 16 Posts: 622 Credit: 1,517,646 RAC: 2 ![]() ![]() |
This task is now running Vers.1.00: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861262 Edit: Task is now running more than 25 Min. with 2 CPU's. Will see tomorrow how it finished. |
Send message Joined: 22 Apr 16 Posts: 622 Credit: 1,517,646 RAC: 2 ![]() ![]() |
This is the second Computer with -native Atlas: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861272 Both have Docker installed. First task finished successful after one hour CPU-Time. Second task also successful. |
Send message Joined: 22 Apr 16 Posts: 622 Credit: 1,517,646 RAC: 2 ![]() ![]() |
Both Computer have finished this night tasks: https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=4064 https://lhcathomedev.cern.ch/lhcathome-dev/show_host_detail.php?hostid=3848 |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I did 2 tasks, not on CentOS but Ubuntu 18.10, and both seem to be valid. I suspended the first one https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861499 for a longer time in BOINC. After resume suddenly it finished. It looks like it runs on in the background during suspend. See Run time <-> CPU-time. The second one https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861513 I did in one flow without pausing. |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I tested the (no)suspend problem again. This should be solved before going into production. ![]() Task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861285 |
Send message Joined: 22 Apr 16 Posts: 622 Credit: 1,517,646 RAC: 2 ![]() ![]() |
Get sometime for Computer with 3 CPU's one task with two CPU's or one with one CPU or one with three CPU's. They all finish correct, this suspending is no problem also in Theory-native and Atlas-native parallel in one Computer. Why is this a problem for you, Crystal? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 ![]() ![]() |
I tested the (no)suspend problem again. This should be solved before going into production. I tested this and I see the same problem if I have checked "Leave non-GPU tasks in memory while suspended" in the computing preferences. If I don't check this then the processes are killed on suspend. Do you see the same behaviour with production WU? I am not sure how the bootstrap script changes would make a difference to this behaviour. |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I tested this and I see the same problem if I have checked "Leave non-GPU tasks in memory while suspended" in the computing preferences. If I don't check this then the processes are killed on suspend.It was working well here and at LHC-production when LAIM was on. The container got a paused state. IIRC when not leaved in memory the job restarts from scratch after resume as it does now with this version. |
Send message Joined: 13 Feb 15 Posts: 1156 Credit: 757,368 RAC: 6 ![]() ![]() |
I tested with a LHC-production task and suspending with LAIM on doesn't pause the athena.py processes either. I tested also with LAIM off. Didn't better do that. 79 events useless processed and events starting from the beginning after resume. Theory behavior: LAIM on/off doesn't matter: In both cases the job is suspended/processes waiting in memory/container paused. https://lhcathome.cern.ch/lhcathome/result.php?resultid=259766289 After a BOINC restart the job starts from scratch. |
Send message Joined: 22 Apr 16 Posts: 622 Credit: 1,517,646 RAC: 2 ![]() ![]() |
This Task finished successful: https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1973470 This is without Docker-use ? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 ![]() ![]() |
I tested with a LHC-production task and suspending with LAIM on doesn't pause the athena.py processes either. So the new version is not worse than the old one? :) I agree that it would be good to fix this problem, however any suspension of native tasks should be avoided because they will always restart from the beginning when resumed, since ATLAS software does not support any checkpointing. |
©2023 CERN