Message boards : CMS Application : New Version 60.66
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 28
Message 7805 - Posted: 16 Sep 2022, 7:09:39 UTC - in response to Message 7804.  

The 2nd, 3rd and 4th job of this task did not have these connection issues.
ID: 7805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 481
Credit: 394,720
RAC: 0
Message 7806 - Posted: 16 Sep 2022, 7:43:19 UTC - in response to Message 7805.  

Even with a local Squid there are a couple of Frontier fail-overs.
These are my numbers from the last week (#requests/data transferred)

default server
cms-frontier.openhtc.io    2.877.375   10.35 GB

fail-overs
cms1-frontier.openhtc.io       4.064   18.12 MB
cms2-frontier.openhtc.io           7   33.13 KB
cms3-frontier.openhtc.io           6   31.03 KB
cms4-frontier.openhtc.io           6   30.91 KB


default server
atlascern-frontier.openhtc.io   1.688.154     9.07 GB

fail-overs
atlascern1-frontier.openhtc.io      3.643    20.55 MB
atlascern4-frontier.openhtc.io        498     2.72 MB
atlascern2-frontier.openhtc.io        285   751.86 KB
atlascern3-frontier.openhtc.io        196   520.31 KB
ID: 7806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 674
Credit: 1,952,785
RAC: 1,018
Message 7807 - Posted: 16 Sep 2022, 8:55:31 UTC - in response to Message 7806.  

atm one CMS from -dev without squid for me.
To see some differents with or without squid (4.15 AND 5.5)
ID: 7807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 28
Message 7808 - Posted: 16 Sep 2022, 10:58:58 UTC
Last modified: 16 Sep 2022, 11:01:14 UTC

The task I started yesterday evening is still in a running state, but not doing a cms-job. 4 jobs has finished.
It is not (yet) finished gracefully by the VM itself although:

09/16/22 12:09:14 (pid:15847) The DaemonShutdown expression "(STARTD_StartTime =?= 0)" evaluated to TRUE: starting graceful shutdown
09/16/22 12:09:14 (pid:15847) Got SIGTERM. Performing graceful shutdown.
09/16/22 12:09:14 (pid:15847) About to tell the ProcD to exit
09/16/22 12:09:14 (pid:15847) All daemons are gone.  Exiting.
09/16/22 12:09:14 (pid:15847) **** condor_master (condor_MASTER) pid 15847 EXITING WITH STATUS 99

Run time over 15 hours and CPU-time over 14 hours.
There is only 1 boinc process active inside the VM: bash.
I'll wait another hour or until the hard shutdown by vboxwrapper after 18 hours.

pid 15847 EXITING WITH STATUS 99: What does that mean?

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112985
ID: 7808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 674
Credit: 1,952,785
RAC: 1,018
Message 7809 - Posted: 16 Sep 2022, 11:40:51 UTC

No squid - starting 1:27 UTC last night - eight finished tasks inside in 10 hours.
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=2208865
ID: 7809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 674
Credit: 1,952,785
RAC: 1,018
Message 7810 - Posted: 16 Sep 2022, 15:12:27 UTC - in response to Message 7809.  

Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished.
Runtime now 13 hours 45 min.
ID: 7810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 28
Message 7811 - Posted: 17 Sep 2022, 7:20:19 UTC - in response to Message 7810.  
Last modified: 17 Sep 2022, 7:33:33 UTC

Since 2 hours (13 UTC) task is doing nothing. 9. task have 1.4 MByte Data, but did not finished.
Runtime now 13 hours 45 min.

Your task also ended after the hard-coded job duration of 64800 seconds (18 hours).
So did Ivan's v60.66 tasks:
Runtime		CPU seconds
64,967.01	39,813.72
64,966.98	39,689.98
64,885.72	5,572.52
64,886.27	5,588.20
64,907.93	5,049.78

The shutdown after 12 hours runtime and a finised job is not working.
On LHC@home there is even a more sophisticated methode to calculate, whether it's worth to request a new job even before the first 12 hours are over.
ID: 7811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 674
Credit: 1,952,785
RAC: 1,018
Message 7812 - Posted: 17 Sep 2022, 9:32:08 UTC - in response to Message 7811.  

Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade).
ID: 7812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 854,677
RAC: 28
Message 7813 - Posted: 17 Sep 2022, 10:08:11 UTC - in response to Message 7812.  

Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade).
No, there were enough jobs.

Since 05.30 UTC this morning we ran out of CMS-jobs.
ID: 7813 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1137
Credit: 8,032,007
RAC: 2,768
Message 7814 - Posted: 17 Sep 2022, 12:19:08 UTC - in response to Message 7813.  

Thought it was a interrupt at 13 UTC in the CMS-Servers (WM-Agent upgrade).
No, there were enough jobs.

Since 05.30 UTC this morning we ran out of CMS-jobs.

From the error messages, it seems a proxy certificate expired! Looking at graphs for "All" production sites, it may have affected them as well, but most have recovered. However our Agent (and at least one other which is also affected), is on a different system (cmsweb-testbed) to others (cmsweb) -- perhaps they fixed the others but forgot about us?
ID: 7814 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : New Version 60.66


©2024 CERN