Message boards : ATLAS Application : ATLAS native 1.00
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6957 - Posted: 20 Jan 2020, 18:14:52 UTC

I suspect the SIGBUS problems come from something missing when running in a standard CentOS 7 host without singularity. So in 1.00 I've added singularity again for all operating systems. @maeax please let me know if this works for you.
ID: 6957 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 6958 - Posted: 20 Jan 2020, 18:28:39 UTC - in response to Message 6957.  
Last modified: 20 Jan 2020, 18:52:07 UTC

This task is now running Vers.1.00:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861262
Edit: Task is now running more than 25 Min. with 2 CPU's.
Will see tomorrow how it finished.
ID: 6958 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 6959 - Posted: 20 Jan 2020, 21:36:43 UTC - in response to Message 6958.  
Last modified: 20 Jan 2020, 22:24:20 UTC

This is the second Computer with -native Atlas:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861272
Both have Docker installed.
First task finished successful after one hour CPU-Time.
Second task also successful.
ID: 6959 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 6960 - Posted: 21 Jan 2020, 5:15:09 UTC

ID: 6960 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6961 - Posted: 21 Jan 2020, 12:58:06 UTC

I did 2 tasks, not on CentOS but Ubuntu 18.10, and both seem to be valid.
I suspended the first one https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861499 for a longer time in BOINC.
After resume suddenly it finished. It looks like it runs on in the background during suspend. See Run time <-> CPU-time.
The second one https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861513 I did in one flow without pausing.
ID: 6961 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6962 - Posted: 22 Jan 2020, 11:57:13 UTC

I tested the (no)suspend problem again. This should be solved before going into production.



Task: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2861285
ID: 6962 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 6963 - Posted: 22 Jan 2020, 15:50:14 UTC - in response to Message 6962.  

Get sometime for Computer with 3 CPU's one task with two CPU's or one with one CPU or one with three CPU's.
They all finish correct, this suspending is no problem also in Theory-native and Atlas-native parallel in one Computer.
Why is this a problem for you, Crystal?
ID: 6963 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6965 - Posted: 23 Jan 2020, 11:18:17 UTC - in response to Message 6962.  

I tested the (no)suspend problem again. This should be solved before going into production.


I tested this and I see the same problem if I have checked "Leave non-GPU tasks in memory while suspended" in the computing preferences. If I don't check this then the processes are killed on suspend.

Do you see the same behaviour with production WU? I am not sure how the bootstrap script changes would make a difference to this behaviour.
ID: 6965 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6966 - Posted: 23 Jan 2020, 12:02:34 UTC - in response to Message 6965.  

I tested this and I see the same problem if I have checked "Leave non-GPU tasks in memory while suspended" in the computing preferences. If I don't check this then the processes are killed on suspend.
Do you see the same behaviour with production WU? I am not sure how the bootstrap script changes would make a difference to this behaviour.
It was working well here and at LHC-production when LAIM was on. The container got a paused state.
IIRC when not leaved in memory the job restarts from scratch after resume as it does now with this version.
ID: 6966 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 6967 - Posted: 23 Jan 2020, 14:41:42 UTC
Last modified: 23 Jan 2020, 15:39:29 UTC

I tested with a LHC-production task and suspending with LAIM on doesn't pause the athena.py processes either.
I tested also with LAIM off. Didn't better do that. 79 events useless processed and events starting from the beginning after resume.

Theory behavior:
LAIM on/off doesn't matter: In both cases the job is suspended/processes waiting in memory/container paused.
https://lhcathome.cern.ch/lhcathome/result.php?resultid=259766289
After a BOINC restart the job starts from scratch.
ID: 6967 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 6968 - Posted: 24 Jan 2020, 9:56:15 UTC

This Task finished successful:
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1973470
This is without Docker-use ?
ID: 6968 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6969 - Posted: 24 Jan 2020, 12:48:58 UTC - in response to Message 6967.  

I tested with a LHC-production task and suspending with LAIM on doesn't pause the athena.py processes either.


So the new version is not worse than the old one? :)

I agree that it would be good to fix this problem, however any suspension of native tasks should be avoided because they will always restart from the beginning when resumed, since ATLAS software does not support any checkpointing.
ID: 6969 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : ATLAS Application : ATLAS native 1.00


©2024 CERN