Message boards : ATLAS Application : ATLAS native 1.01
Message board moderation

To post messages, you must log in.

AuthorMessage
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6976 - Posted: 6 Feb 2020, 10:37:30 UTC

computezrmle has provided a better method of stderr logging for the new native script, which is now in version 1.01. I have injected a few test tasks here to make sure everything is ok before moving to the production server.
ID: 6976 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 3
Message 6977 - Posted: 6 Feb 2020, 10:50:20 UTC - in response to Message 6976.  
Last modified: 6 Feb 2020, 11:47:26 UTC

This task is now running Vers.1.01 with Docker:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868669

Edit: See the same in this task after one hour runtime.
Had made
docker pull davidgcameron/boinc-atlas:latest
before.
Is it in this image?
ID: 6977 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6978 - Posted: 6 Feb 2020, 11:36:49 UTC

ID: 6978 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 3
Message 6979 - Posted: 6 Feb 2020, 11:57:03 UTC - in response to Message 6978.  

Have a new one:
pilotlog.txt is shown.
last line: monitor pilot.control.monitor... 366s have passed since pilot start
ID: 6979 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 6980 - Posted: 6 Feb 2020, 12:28:55 UTC - in response to Message 6978.  

Hmm, it doesn't seem to work for me: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868679

Same here:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868682

Just checked your original code and noticed that you sent the output to stderr.
Let's try to modify line 20 like this:
exec &> >(awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }' 1>&2)
ID: 6980 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6981 - Posted: 6 Feb 2020, 15:47:26 UTC - in response to Message 6980.  

Ok, I just released 1.02 with this change.
ID: 6981 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 6982 - Posted: 6 Feb 2020, 16:51:21 UTC

ID: 6982 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 861,475
RAC: 3
Message 6983 - Posted: 6 Feb 2020, 18:06:04 UTC - in response to Message 6982.  

This task succeeded:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868726
Yes, task succeeded, but almost the whole stderr.txt is filled when the job is ready.
Only the first 2 lines are written at the beginning. The rest, starting with the line [2020-02-06 17:49:03] Arguments: --nthreads 2 from this
result https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868721, is added towards the end of the job.
ID: 6983 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 3
Message 6984 - Posted: 6 Feb 2020, 19:12:41 UTC - in response to Message 6981.  

This task finished also. But with singularity instead of docker.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868727
ID: 6984 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 6985 - Posted: 6 Feb 2020, 19:25:39 UTC - in response to Message 6983.  

... but almost the whole stderr.txt is filled when the job is ready.

May be due to awk is using I/O buffers in non-interactive mode.
Sent a patch to David on github.
If this doesn't fill stderr.txt earlier the reason is most likely I/O-buffering in other scripts which would mean much more effort.
ID: 6985 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
David Cameron
Project administrator
Project developer
Project tester
Project scientist

Send message
Joined: 20 Apr 16
Posts: 180
Credit: 1,355,327
RAC: 0
Message 6986 - Posted: 6 Feb 2020, 20:56:41 UTC - in response to Message 6985.  

Thanks, the new patch is in 1.03.
ID: 6986 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 1
Message 6987 - Posted: 6 Feb 2020, 22:30:49 UTC - in response to Message 6986.  

1.03 works fine.
It prints the loglines immediately to stderr.txt.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868929
ID: 6987 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 3
Message 6988 - Posted: 7 Feb 2020, 9:10:50 UTC - in response to Message 6986.  

Is this info useful:
[2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | --- collectZombieJob: --- 10, [11141]
[2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | zombie collector trying to kill pid 11141
[2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | harmless exception when collecting zombies: [Errno 10] No child processes
[2020-02-07 09:26:18] 2020-02-07 08:25:57,347 | INFO | retrieve | pilot.util.processes | cleanup | collected zombie processes
[2020-02-07 09:26:18] 2020-02-07 08:25:57,348 | INFO | retrieve | pilot.util.processes | cleanup | will now attempt to kill all subprocesses of pid=11141
[2020-02-07 09:26:18] 2020-02-07 08:25:57,620 | INFO | retrieve | pilot.util.processes | kill_processes | process IDs to be killed: [11141] (in reverse order)
[2020-02-07 09:26:18] 2020-02-07 08:25:57,894 | WARNING | retrieve | pilot.util.processes | kill_processes | found no corresponding commands to process id(s)
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868918
ID: 6988 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : ATLAS Application : ATLAS native 1.01


©2024 CERN