New Version 60.66

Author	Message
Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1182 Credit: 816,328 RAC: 288	Message 7805 - Posted: 16 Sep 2022, 7:09:39 UTC - in response to Message 7804. The 2nd, 3rd and 4th job of this task did not have these connection issues. ID: 7805 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 475 Credit: 389,411 RAC: 28	Message 7806 - Posted: 16 Sep 2022, 7:43:19 UTC - in response to Message 7805. Even with a local Squid there are a couple of Frontier fail-overs. These are my numbers from the last week (#requests/data transferred) default server cms-frontier.openhtc.io 2.877.375 10.35 GB fail-overs cms1-frontier.openhtc.io 4.064 18.12 MB cms2-frontier.openhtc.io 7 33.13 KB cms3-frontier.openhtc.io 6 31.03 KB cms4-frontier.openhtc.io 6 30.91 KB default server atlascern-frontier.openhtc.io 1.688.154 9.07 GB fail-overs atlascern1-frontier.openhtc.io 3.643 20.55 MB atlascern4-frontier.openhtc.io 498 2.72 MB atlascern2-frontier.openhtc.io 285 751.86 KB atlascern3-frontier.openhtc.io 196 520.31 KB ID: 7806 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,810,226 RAC: 2,276	Message 7807 - Posted: 16 Sep 2022, 8:55:31 UTC - in response to Message 7806. atm one CMS from -dev without squid for me. To see some differents with or without squid (4.15 AND 5.5) ID: 7807 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1182 Credit: 816,328 RAC: 288	Message 7808 - Posted: 16 Sep 2022, 10:58:58 UTC Last modified: 16 Sep 2022, 11:01:14 UTC The task I started yesterday evening is still in a running state, but not doing a cms-job. 4 jobs has finished. It is not (yet) finished gracefully by the VM itself although: 09/16/22 12:09:14 (pid:15847) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown 09/16/22 12:09:14 (pid:15847) Got SIGTERM. Performing graceful shutdown. 09/16/22 12:09:14 (pid:15847) About to tell the ProcD to exit 09/16/22 12:09:14 (pid:15847) All daemons are gone. Exiting. 09/16/22 12:09:14 (pid:15847) **** condor_master (condor_MASTER) pid 15847 EXITING WITH STATUS 99 Run time over 15 hours and CPU-time over 14 hours. There is only 1 boinc process active inside the VM: bash. I'll wait another hour or until the hard shutdown by vboxwrapper after 18 hours. pid 15847 EXITING WITH STATUS 99: What does that mean? https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112985 ID: 7808 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,810,226 RAC: 2,276	Message 7809 - Posted: 16 Sep 2022, 11:40:51 UTC No squid - starting 1:27 UTC last night - eight finished tasks inside in 10 hours. https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2208865 ID: 7809 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,810,226 RAC: 2,276	Message 7810 - Posted: 16 Sep 2022, 15:12:27 UTC - in response to Message 7809. Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished. Runtime now 13 hours 45 min. ID: 7810 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1182 Credit: 816,328 RAC: 288	Message 7811 - Posted: 17 Sep 2022, 7:20:19 UTC - in response to Message 7810. Last modified: 17 Sep 2022, 7:33:33 UTC Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished. Runtime now 13 hours 45 min. Your task also ended after the hard-coded job duration of 64800 seconds (18 hours). So did Ivan's v60.66 tasks: Runtime CPU seconds 64,967.01 39,813.72 64,966.98 39,689.98 64,885.72 5,572.52 64,886.27 5,588.20 64,907.93 5,049.78 The shutdown after 12 hours runtime and a finised job is not working. On LHC@home there is even a more sophisticated methode to calculate, whether it's worth to request a new job even before the first 12 hours are over. ID: 7811 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,810,226 RAC: 2,276	Message 7812 - Posted: 17 Sep 2022, 9:32:08 UTC - in response to Message 7811. Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade). ID: 7812 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1182 Credit: 816,328 RAC: 288	Message 7813 - Posted: 17 Sep 2022, 10:08:11 UTC - in response to Message 7812. Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade). No, there were enough jobs. Since 05.30 UTC this morning we ran out of CMS-jobs. ID: 7813 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,879,942 RAC: 562	Message 7814 - Posted: 17 Sep 2022, 12:19:08 UTC - in response to Message 7813. Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade). No, there were enough jobs. Since 05.30 UTC this morning we ran out of CMS-jobs. From the error messages, it seems a proxy certificate expired! Looking at graphs for "All" production sites, it may have affected them as well, but most have recovered. However our Agent (and at least one other which is also affected), is on a different system (cmsweb-testbed) to others (cmsweb) -- perhaps they fixed the others but forgot about us? ID: 7814 · Rating: 0 · rate: / Reply Quote

Development for LHC@home