Message boards : CMS Application : New Version 60.66
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 56 ![]() ![]() |
The 2nd, 3rd and 4th job of this task did not have these connection issues. |
![]() Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 ![]() ![]() |
Even with a local Squid there are a couple of Frontier fail-overs. These are my numbers from the last week (#requests/data transferred) default server cms-frontier.openhtc.io 2.877.375 10.35 GB fail-overs cms1-frontier.openhtc.io 4.064 18.12 MB cms2-frontier.openhtc.io 7 33.13 KB cms3-frontier.openhtc.io 6 31.03 KB cms4-frontier.openhtc.io 6 30.91 KB default server atlascern-frontier.openhtc.io 1.688.154 9.07 GB fail-overs atlascern1-frontier.openhtc.io 3.643 20.55 MB atlascern4-frontier.openhtc.io 498 2.72 MB atlascern2-frontier.openhtc.io 285 751.86 KB atlascern3-frontier.openhtc.io 196 520.31 KB |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
atm one CMS from -dev without squid for me. To see some differents with or without squid (4.15 AND 5.5) |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 56 ![]() ![]() |
The task I started yesterday evening is still in a running state, but not doing a cms-job. 4 jobs has finished. It is not (yet) finished gracefully by the VM itself although: 09/16/22 12:09:14 (pid:15847) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 09/16/22 12:09:14 (pid:15847) Got SIGTERM. Performing graceful shutdown. 09/16/22 12:09:14 (pid:15847) About to tell the ProcD to exit 09/16/22 12:09:14 (pid:15847) All daemons are gone. Exiting. 09/16/22 12:09:14 (pid:15847) **** condor_master (condor_MASTER) pid 15847 EXITING WITH STATUS 99 Run time over 15 hours and CPU-time over 14 hours. There is only 1 boinc process active inside the VM: bash. I'll wait another hour or until the hard shutdown by vboxwrapper after 18 hours. pid 15847 EXITING WITH STATUS 99: What does that mean? https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112985 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
No squid - starting 1:27 UTC last night - eight finished tasks inside in 10 hours. https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2208865 |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished. Runtime now 13 hours 45 min. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 56 ![]() ![]() |
Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished. Your task also ended after the hard-coded job duration of 64800 seconds (18 hours). So did Ivan's v60.66 tasks: Runtime CPU seconds 64,967.01 39,813.72 64,966.98 39,689.98 64,885.72 5,572.52 64,886.27 5,588.20 64,907.93 5,049.78 The shutdown after 12 hours runtime and a finised job is not working. On LHC@home there is even a more sophisticated methode to calculate, whether it's worth to request a new job even before the first 12 hours are over. |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 ![]() ![]() |
Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade). |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 56 ![]() ![]() |
Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade).No, there were enough jobs. Since 05.30 UTC this morning we ran out of CMS-jobs. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade).No, there were enough jobs. From the error messages, it seems a proxy certificate expired! Looking at graphs for "All" production sites, it may have affected them as well, but most have recovered. However our Agent (and at least one other which is also affected), is on a different system (cmsweb-testbed) to others (cmsweb) -- perhaps they fixed the others but forgot about us? ![]() |
©2025 CERN