Message boards :
ATLAS Application :
ATLAS native 1.01
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
computezrmle has provided a better method of stderr logging for the new native script, which is now in version 1.01. I have injected a few test tasks here to make sure everything is ok before moving to the production server. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
This task is now running Vers.1.01 with Docker: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868669 Edit: See the same in this task after one hour runtime. Had made docker pull davidgcameron/boinc-atlas:latest before. Is it in this image? |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Hmm, it doesn't seem to work for me: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868679 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
Have a new one: pilotlog.txt is shown. last line: monitor pilot.control.monitor... 366s have passed since pilot start |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
Hmm, it doesn't seem to work for me: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868679 Same here: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868682 Just checked your original code and noticed that you sent the output to stderr. Let's try to modify line 20 like this: exec &> >(awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }' 1>&2) |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Ok, I just released 1.02 with this change. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
This task succeeded: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868726 |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 3 |
This task succeeded:Yes, task succeeded, but almost the whole stderr.txt is filled when the job is ready. Only the first 2 lines are written at the beginning. The rest, starting with the line [2020-02-06 17:49:03] Arguments: --nthreads 2 from this result https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868721, is added towards the end of the job. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
This task finished also. But with singularity instead of docker. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868727 |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
... but almost the whole stderr.txt is filled when the job is ready. May be due to awk is using I/O buffers in non-interactive mode. Sent a patch to David on github. If this doesn't fill stderr.txt earlier the reason is most likely I/O-buffering in other scripts which would mean much more effort. |
Send message Joined: 20 Apr 16 Posts: 180 Credit: 1,355,327 RAC: 0 |
Thanks, the new patch is in 1.03. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 1 |
1.03 works fine. It prints the loglines immediately to stderr.txt. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868929 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 3 |
Is this info useful: [2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | --- collectZombieJob: --- 10, [11141] [2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | zombie collector trying to kill pid 11141 [2020-02-07 09:26:18] 2020-02-07 08:25:56,340 | INFO | retrieve | pilot.info.jobdata | collect_zombies | harmless exception when collecting zombies: [Errno 10] No child processes [2020-02-07 09:26:18] 2020-02-07 08:25:57,347 | INFO | retrieve | pilot.util.processes | cleanup | collected zombie processes [2020-02-07 09:26:18] 2020-02-07 08:25:57,348 | INFO | retrieve | pilot.util.processes | cleanup | will now attempt to kill all subprocesses of pid=11141 [2020-02-07 09:26:18] 2020-02-07 08:25:57,620 | INFO | retrieve | pilot.util.processes | kill_processes | process IDs to be killed: [11141] (in reverse order) [2020-02-07 09:26:18] 2020-02-07 08:25:57,894 | WARNING | retrieve | pilot.util.processes | kill_processes | found no corresponding commands to process id(s) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2868918 |
©2024 CERN