Message boards : CMS Application : New Version 60.66
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1069
Credit: 334,882
RAC: 0
Message 7782 - Posted: 12 Sep 2022, 9:10:32 UTC

New version with a new image containing a few small tweaks.
ID: 7782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7783 - Posted: 12 Sep 2022, 9:22:26 UTC

Laurence did you know it was 2am PST or was that a lucky guess?

Mad Scientist For Life
ID: 7783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 0
Message 7784 - Posted: 12 Sep 2022, 10:30:58 UTC - in response to Message 7783.  

Magic,
when you stand up in the morning, this 1.5 GByte are transfered ;-)).
ID: 7784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7785 - Posted: 12 Sep 2022, 10:45:55 UTC - in response to Message 7784.  

Magic,
when you stand up in the morning, this 1.5 GByte are transfered ;-)).


2am is always my favorite time of day with Cern
I was planning on using the rest of my monthly high-speed internet tonight since my new month starts seconds after midnight in about 20 hours.

And tonight I had just enough to d/l the new vdi on 2 that I run here and the 3rd host was started running 60.65 seconds BEFORE Laurence started this new one so I will be ready to have all 3 hosts running this new version.

Now I hope we don't have any more new vdi's until after October 13th and btw I start my 19th year at Cern on October 24th

AND I had perfect timing and got all the rest loaded up with Sixtracks ( about 1,500 so far tonight)

Goodnight Axel
ID: 7785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7786 - Posted: 13 Sep 2022, 10:46:02 UTC

It looks like no work with this version again.
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192

I just started 3 more just to see if it does this again (almost 4am)
If it has nothing to run again I will suspend and wait for an update (Laurence)
ID: 7786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7787 - Posted: 13 Sep 2022, 12:07:30 UTC

Well it looks like we have one actually running some subtask finally.
(I have been watching the log run for an hour)

I am just going to let this one run by itself and restart a new batch later today.

goodnight 5am
ID: 7787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 123
Message 7788 - Posted: 14 Sep 2022, 15:16:59 UTC
Last modified: 14 Sep 2022, 15:47:10 UTC

I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory.

Note that we will have no jobs for a while tonight while the WMAgent is upgraded, but keep an eye on your VMs' top consoles to see if they are actually running cmsRun jobs.
ID: 7788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 123
Message 7789 - Posted: 14 Sep 2022, 20:24:26 UTC - in response to Message 7788.  
Last modified: 14 Sep 2022, 20:24:46 UTC

I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory.

That glitch has been patched. I need to wait until tomorrow when I can submit new jobs, and check running BOINC on my work PC.
ID: 7789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7790 - Posted: 15 Sep 2022, 7:21:06 UTC
Last modified: 15 Sep 2022, 8:01:31 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112789

Well as I mentioned I watched the log until it actually started running and it did even though it only used the cpu for 2 hours 49 min but ran for 18 hours.

I always check the log to see that it is running the task before I trust it and that is usually one to two hours after first starting these VB tasks.

But since that one worked I decided it would run last night so I ran 3 more tasks and since I couldn't stay up as late as usual since I had a meeting tonight and all three did run the usual 18 hours but only one hour cpu running time.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112868

All 3 were basically the same as this one.

I will try another 3 tasks tonight (PST) and even check to see if the logs say they are running and I pretty much do that every time......and I mean thousands of times over the years.

Which is the reason I always use remote to see "events processed" of each running task

And ALL of the log files (and all are windows 10)
ID: 7790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 0
Message 7791 - Posted: 15 Sep 2022, 8:04:49 UTC - in response to Message 7790.  

CMS-Team is upgrading WM-Agent today, as Ivan wrote in Production.
ID: 7791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7792 - Posted: 15 Sep 2022, 10:21:23 UTC - in response to Message 7791.  
Last modified: 15 Sep 2022, 10:23:19 UTC

Yes he said that here too.
But that was 14 hours ago so I figured he would have done that by now on his *work pc*

Am I the only one that works 24/7

I almost finished all 7,551 of my Sixtracks over there in 2 days!
(btw that is the most Sixtracks I ever got in the same batch in all 18 years doing this)
And we run a different version of CMS here too.

Ah well it is 3:20am here right now so I might stay up another hour.
ID: 7792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 7793 - Posted: 15 Sep 2022, 11:03:41 UTC - in response to Message 7792.  

... And we run a different version of CMS here too.

Not really.
The vdi is different, but this is just the 'envelope'.
The CMS payload comes from the same backend systems (WMAgent/HTCondor).
If their queue runs dry -dev and -prod are both affected.
ID: 7793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 781
Credit: 12,422,653
RAC: 3,337
Message 7795 - Posted: 15 Sep 2022, 11:18:16 UTC - in response to Message 7793.  
Last modified: 15 Sep 2022, 11:20:14 UTC

Yes I know all of that Stefan

goodnight
ID: 7795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 677
Credit: 2,002,766
RAC: 0
Message 7797 - Posted: 15 Sep 2022, 15:15:02 UTC - in response to Message 7795.  

CMS-Team is fast, but so fast... Thursday is a good day for update.
Friday is for cleaning the Error's ;-))
ID: 7797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 123
Message 7798 - Posted: 15 Sep 2022, 15:43:13 UTC

There has been a delay, I'm afraid. Still, tomorrow is for clean-up!
ID: 7798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 123
Message 7799 - Posted: 15 Sep 2022, 17:55:38 UTC - in response to Message 7798.  

There has been a delay, I'm afraid. Still, tomorrow is for clean-up!

Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow.
ID: 7799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 25
Message 7800 - Posted: 15 Sep 2022, 19:53:10 UTC - in response to Message 7799.  

Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow.

Not looking good so far:
cmsRun  -j FrameworkJobReport.xml PSet.py
warn  [frontier.c:1014]: Request 507 on chan 1 failed at Thu Sep 15 21:48:52 2022: -6 [fn-socket.c:239]: read from 172.64.206.32 timed out after 10 seconds
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[172.64.207.32]
warn  [frontier.c:1014]: Request 701 on chan 1 failed at Thu Sep 15 21:49:07 2022: -6 [fn-socket.c:239]: read from 172.64.207.32 timed out after 10 seconds
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:cf20]
warn  [frontier.c:1014]: Request 702 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:cf20: Network is unreachable
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:ce20]
warn  [frontier.c:1014]: Request 703 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:ce20: Network is unreachable
warn  [frontier.c:1136]: Trying next server cms1-frontier.openhtc.io
warn  [frontier.c:1014]: Request 704 on chan 1 failed at Thu Sep 15 21:49:27 2022: -6 [fn-urlparse.c:178]: host name cms1-frontier.openhtc.io problem: Name or service not known
warn  [frontier.c:1136]: Trying next server cms2-frontier.openhtc.io

... but finally:
Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST
ID: 7800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1139
Credit: 8,310,612
RAC: 123
Message 7801 - Posted: 15 Sep 2022, 20:14:35 UTC - in response to Message 7800.  

Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow.

Not looking good so far:
cmsRun  -j FrameworkJobReport.xml PSet.py
warn  [frontier.c:1014]: Request 507 on chan 1 failed at Thu Sep 15 21:48:52 2022: -6 [fn-socket.c:239]: read from 172.64.206.32 timed out after 10 seconds
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[172.64.207.32]
warn  [frontier.c:1014]: Request 701 on chan 1 failed at Thu Sep 15 21:49:07 2022: -6 [fn-socket.c:239]: read from 172.64.207.32 timed out after 10 seconds
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:cf20]
warn  [frontier.c:1014]: Request 702 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:cf20: Network is unreachable
warn  [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:ce20]
warn  [frontier.c:1014]: Request 703 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:ce20: Network is unreachable
warn  [frontier.c:1136]: Trying next server cms1-frontier.openhtc.io
warn  [frontier.c:1014]: Request 704 on chan 1 failed at Thu Sep 15 21:49:27 2022: -6 [fn-urlparse.c:178]: host name cms1-frontier.openhtc.io problem: Name or service not known
warn  [frontier.c:1136]: Trying next server cms2-frontier.openhtc.io
Hmm, that's not the glitch I was looking at. That's more reminiscent of IPv6 problems I've seen in the past. IPv4 fails (172.64.207.32) and then IPv6 is not available (2606:4700:e6::ac40:ce20).
... but finally:
Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST
That certainly looks better. Transient network problems earlier?
ID: 7801 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1188
Credit: 862,257
RAC: 25
Message 7802 - Posted: 15 Sep 2022, 20:44:08 UTC - in response to Message 7801.  
Last modified: 15 Sep 2022, 20:45:09 UTC

That certainly looks better. Transient network problems earlier?
I had and have no network problems. I'm not using any proxy.
ID: 7802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 484
Credit: 394,839
RAC: 0
Message 7804 - Posted: 16 Sep 2022, 6:11:54 UTC - in response to Message 7802.  

The warnings are reported by the Frontier client inside the VM.
Since at the end the jobs are running the fail-over obviously works.

The issues can be, e.g.
- 1 (or more) frontier server temporarily not responding => Frontier contacts the next one
- a local network overload (mostly the router); too many open connections

If it's the latter it is caused by a high peak load from Frontier.
Frontier sends data in chunks (16 kB each IIRC) which concurrently opens lots of TCP connections.
This works fine until the router's resources are fully used.
New connections are not accepted until older (but idle) connections time out.

A local Squid solves this since
- in case of CMS up to 98 % of the Frontier requests can be returned by Squid
- a local Squid usually doesn't use chunks
ID: 7804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : New Version 60.66


©2024 CERN