Thread 'New Version 60.66'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 7782 - Posted: 12 Sep 2022, 9:10:32 UTC New version with a new image containing a few small tweaks. ID: 7782 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7783 - Posted: 12 Sep 2022, 9:22:26 UTC Laurence did you know it was 2am PST or was that a lucky guess? Mad Scientist For Life ID: 7783 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7784 - Posted: 12 Sep 2022, 10:30:58 UTC - in response to Message 7783. Magic, when you stand up in the morning, this 1.5 GByte are transfered ;-)). ID: 7784 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7785 - Posted: 12 Sep 2022, 10:45:55 UTC - in response to Message 7784. Magic, when you stand up in the morning, this 1.5 GByte are transfered ;-)). 2am is always my favorite time of day with Cern I was planning on using the rest of my monthly high-speed internet tonight since my new month starts seconds after midnight in about 20 hours. And tonight I had just enough to d/l the new vdi on 2 that I run here and the 3rd host was started running 60.65 seconds BEFORE Laurence started this new one so I will be ready to have all 3 hosts running this new version. Now I hope we don't have any more new vdi's until after October 13th and btw I start my 19th year at Cern on October 24th AND I had perfect timing and got all the rest loaded up with Sixtracks ( about 1,500 so far tonight) Goodnight Axel ID: 7785 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7786 - Posted: 13 Sep 2022, 10:46:02 UTC It looks like no work with this version again. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 I just started 3 more just to see if it does this again (almost 4am) If it has nothing to run again I will suspend and wait for an update (Laurence) ID: 7786 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7787 - Posted: 13 Sep 2022, 12:07:30 UTC Well it looks like we have one actually running some subtask finally. (I have been watching the log run for an hour) I am just going to let this one run by itself and restart a new batch later today. goodnight 5am ID: 7787 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 5	Message 7788 - Posted: 14 Sep 2022, 15:16:59 UTC Last modified: 14 Sep 2022, 15:47:10 UTC I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory. Note that we will have no jobs for a while tonight while the WMAgent is upgraded, but keep an eye on your VMs' top consoles to see if they are actually running cmsRun jobs. ID: 7788 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 5	Message 7789 - Posted: 14 Sep 2022, 20:24:26 UTC - in response to Message 7788. Last modified: 14 Sep 2022, 20:24:46 UTC I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory. That glitch has been patched. I need to wait until tomorrow when I can submit new jobs, and check running BOINC on my work PC. ID: 7789 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7790 - Posted: 15 Sep 2022, 7:21:06 UTC Last modified: 15 Sep 2022, 8:01:31 UTC https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112789 Well as I mentioned I watched the log until it actually started running and it did even though it only used the cpu for 2 hours 49 min but ran for 18 hours. I always check the log to see that it is running the task before I trust it and that is usually one to two hours after first starting these VB tasks. But since that one worked I decided it would run last night so I ran 3 more tasks and since I couldn't stay up as late as usual since I had a meeting tonight and all three did run the usual 18 hours but only one hour cpu running time. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112868 All 3 were basically the same as this one. I will try another 3 tasks tonight (PST) and even check to see if the logs say they are running and I pretty much do that every time......and I mean thousands of times over the years. Which is the reason I always use remote to see "events processed" of each running task And ALL of the log files (and all are windows 10) ID: 7790 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7791 - Posted: 15 Sep 2022, 8:04:49 UTC - in response to Message 7790. CMS-Team is upgrading WM-Agent today, as Ivan wrote in Production. ID: 7791 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7792 - Posted: 15 Sep 2022, 10:21:23 UTC - in response to Message 7791. Last modified: 15 Sep 2022, 10:23:19 UTC Yes he said that here too. But that was 14 hours ago so I figured he would have done that by now on his work pc Am I the only one that works 24/7 I almost finished all 7,551 of my Sixtracks over there in 2 days! (btw that is the most Sixtracks I ever got in the same batch in all 18 years doing this) And we run a different version of CMS here too. Ah well it is 3:20am here right now so I might stay up another hour. ID: 7792 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7793 - Posted: 15 Sep 2022, 11:03:41 UTC - in response to Message 7792. ... And we run a different version of CMS here too. Not really. The vdi is different, but this is just the 'envelope'. The CMS payload comes from the same backend systems (WMAgent/HTCondor). If their queue runs dry -dev and -prod are both affected. ID: 7793 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1019 Credit: 18,567,025 RAC: 20,613	Message 7795 - Posted: 15 Sep 2022, 11:18:16 UTC - in response to Message 7793. Last modified: 15 Sep 2022, 11:20:14 UTC Yes I know all of that Stefan goodnight ID: 7795 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 806 Credit: 4,294,466 RAC: 1,957	Message 7797 - Posted: 15 Sep 2022, 15:15:02 UTC - in response to Message 7795. CMS-Team is fast, but so fast... Thursday is a good day for update. Friday is for cleaning the Error's ;-)) ID: 7797 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 5	Message 7798 - Posted: 15 Sep 2022, 15:43:13 UTC There has been a delay, I'm afraid. Still, tomorrow is for clean-up! ID: 7798 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 5	Message 7799 - Posted: 15 Sep 2022, 17:55:38 UTC - in response to Message 7798. There has been a delay, I'm afraid. Still, tomorrow is for clean-up! Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. ID: 7799 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 49	Message 7800 - Posted: 15 Sep 2022, 19:53:10 UTC - in response to Message 7799. Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. Not looking good so far: cmsRun -j FrameworkJobReport.xml PSet.py warn [frontier.c:1014]: Request 507 on chan 1 failed at Thu Sep 15 21:48:52 2022: -6 [fn-socket.c:239]: read from 172.64.206.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[172.64.207.32] warn [frontier.c:1014]: Request 701 on chan 1 failed at Thu Sep 15 21:49:07 2022: -6 [fn-socket.c:239]: read from 172.64.207.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:cf20] warn [frontier.c:1014]: Request 702 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:cf20: Network is unreachable warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:ce20] warn [frontier.c:1014]: Request 703 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:ce20: Network is unreachable warn [frontier.c:1136]: Trying next server cms1-frontier.openhtc.io warn [frontier.c:1014]: Request 704 on chan 1 failed at Thu Sep 15 21:49:27 2022: -6 [fn-urlparse.c:178]: host name cms1-frontier.openhtc.io problem: Name or service not known warn [frontier.c:1136]: Trying next server cms2-frontier.openhtc.io ... but finally: Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST ID: 7800 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 5	Message 7801 - Posted: 15 Sep 2022, 20:14:35 UTC - in response to Message 7800. Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. Not looking good so far: cmsRun -j FrameworkJobReport.xml PSet.py warn [frontier.c:1014]: Request 507 on chan 1 failed at Thu Sep 15 21:48:52 2022: -6 [fn-socket.c:239]: read from 172.64.206.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[172.64.207.32] warn [frontier.c:1014]: Request 701 on chan 1 failed at Thu Sep 15 21:49:07 2022: -6 [fn-socket.c:239]: read from 172.64.207.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:cf20] warn [frontier.c:1014]: Request 702 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:cf20: Network is unreachable warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:ce20] warn [frontier.c:1014]: Request 703 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:ce20: Network is unreachable warn [frontier.c:1136]: Trying next server cms1-frontier.openhtc.io warn [frontier.c:1014]: Request 704 on chan 1 failed at Thu Sep 15 21:49:27 2022: -6 [fn-urlparse.c:178]: host name cms1-frontier.openhtc.io problem: Name or service not known warn [frontier.c:1136]: Trying next server cms2-frontier.openhtc.io Hmm, that's not the glitch I was looking at. That's more reminiscent of IPv6 problems I've seen in the past. IPv4 fails (172.64.207.32) and then IPv6 is not available (2606:4700:e6::ac40:ce20). ... but finally: Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST That certainly looks better. Transient network problems earlier? ID: 7801 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1281 Credit: 1,048,477 RAC: 49	Message 7802 - Posted: 15 Sep 2022, 20:44:08 UTC - in response to Message 7801. Last modified: 15 Sep 2022, 20:45:09 UTC That certainly looks better. Transient network problems earlier? I had and have no network problems. I'm not using any proxy. ID: 7802 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 544 Credit: 400,710 RAC: 0	Message 7804 - Posted: 16 Sep 2022, 6:11:54 UTC - in response to Message 7802. The warnings are reported by the Frontier client inside the VM. Since at the end the jobs are running the fail-over obviously works. The issues can be, e.g. - 1 (or more) frontier server temporarily not responding => Frontier contacts the next one - a local network overload (mostly the router); too many open connections If it's the latter it is caused by a high peak load from Frontier. Frontier sends data in chunks (16 kB each IIRC) which concurrently opens lots of TCP connections. This works fine until the router's resources are fully used. New connections are not accepted until older (but idle) connections time out. A local Squid solves this since - in case of CMS up to 98 % of the Frontier requests can be returned by Squid - a local Squid usually doesn't use chunks ID: 7804 · Rating: 0 · rate: / Reply Quote

Development for LHC@home