Message boards :
CMS Application :
New Version 60.66
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
New version with a new image containing a few small tweaks. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Laurence did you know it was 2am PST or was that a lucky guess? Mad Scientist For Life |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
Magic, when you stand up in the morning, this 1.5 GByte are transfered ;-)). |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Magic, 2am is always my favorite time of day with Cern I was planning on using the rest of my monthly high-speed internet tonight since my new month starts seconds after midnight in about 20 hours. And tonight I had just enough to d/l the new vdi on 2 that I run here and the 3rd host was started running 60.65 seconds BEFORE Laurence started this new one so I will be ready to have all 3 hosts running this new version. Now I hope we don't have any more new vdi's until after October 13th and btw I start my 19th year at Cern on October 24th AND I had perfect timing and got all the rest loaded up with Sixtracks ( about 1,500 so far tonight) Goodnight Axel |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
It looks like no work with this version again. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 I just started 3 more just to see if it does this again (almost 4am) If it has nothing to run again I will suspend and wait for an update (Laurence) |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Well it looks like we have one actually running some subtask finally. (I have been watching the log run for an hour) I am just going to let this one run by itself and restart a new batch later today. goodnight 5am |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory. Note that we will have no jobs for a while tonight while the WMAgent is upgraded, but keep an eye on your VMs' top consoles to see if they are actually running cmsRun jobs. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
I have a problem on my Win10 box -- a task ran all night without actually running a job. Started a new VM just now and it seemed to try to run the glidein script in the wrong directory. That glitch has been patched. I need to wait until tomorrow when I can submit new jobs, and check running BOINC on my work PC. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112789 Well as I mentioned I watched the log until it actually started running and it did even though it only used the cpu for 2 hours 49 min but ran for 18 hours. I always check the log to see that it is running the task before I trust it and that is usually one to two hours after first starting these VB tasks. But since that one worked I decided it would run last night so I ran 3 more tasks and since I couldn't stay up as late as usual since I had a meeting tonight and all three did run the usual 18 hours but only one hour cpu running time. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=3112868 All 3 were basically the same as this one. I will try another 3 tasks tonight (PST) and even check to see if the logs say they are running and I pretty much do that every time......and I mean thousands of times over the years. Which is the reason I always use remote to see "events processed" of each running task And ALL of the log files (and all are windows 10) |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
CMS-Team is upgrading WM-Agent today, as Ivan wrote in Production. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Yes he said that here too. But that was 14 hours ago so I figured he would have done that by now on his *work pc* Am I the only one that works 24/7 I almost finished all 7,551 of my Sixtracks over there in 2 days! (btw that is the most Sixtracks I ever got in the same batch in all 18 years doing this) And we run a different version of CMS here too. Ah well it is 3:20am here right now so I might stay up another hour. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
... And we run a different version of CMS here too. Not really. The vdi is different, but this is just the 'envelope'. The CMS payload comes from the same backend systems (WMAgent/HTCondor). If their queue runs dry -dev and -prod are both affected. |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 3,337 |
Yes I know all of that Stefan goodnight |
Send message Joined: 22 Apr 16 Posts: 677 Credit: 2,002,766 RAC: 0 |
CMS-Team is fast, but so fast... Thursday is a good day for update. Friday is for cleaning the Error's ;-)) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
There has been a delay, I'm afraid. Still, tomorrow is for clean-up! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
There has been a delay, I'm afraid. Still, tomorrow is for clean-up! Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. Not looking good so far: cmsRun -j FrameworkJobReport.xml PSet.py warn [frontier.c:1014]: Request 507 on chan 1 failed at Thu Sep 15 21:48:52 2022: -6 [fn-socket.c:239]: read from 172.64.206.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[172.64.207.32] warn [frontier.c:1014]: Request 701 on chan 1 failed at Thu Sep 15 21:49:07 2022: -6 [fn-socket.c:239]: read from 172.64.207.32 timed out after 10 seconds warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:cf20] warn [frontier.c:1014]: Request 702 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:cf20: Network is unreachable warn [frontier.c:1136]: Trying next server cms-frontier.openhtc.io[2606:4700:e6::ac40:ce20] warn [frontier.c:1014]: Request 703 on chan 1 failed at Thu Sep 15 21:49:07 2022: -9 [fn-socket.c:85]: network error on connect to 2606:4700:e6::ac40:ce20: Network is unreachable warn [frontier.c:1136]: Trying next server cms1-frontier.openhtc.io warn [frontier.c:1014]: Request 704 on chan 1 failed at Thu Sep 15 21:49:27 2022: -6 [fn-urlparse.c:178]: host name cms1-frontier.openhtc.io problem: Name or service not known warn [frontier.c:1136]: Trying next server cms2-frontier.openhtc.io ... but finally: Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 123 |
Hmm, that's not the glitch I was looking at. That's more reminiscent of IPv6 problems I've seen in the past. IPv4 fails (172.64.207.32) and then IPv6 is not available (2606:4700:e6::ac40:ce20).Jobs now available again. I believe the little glitch here in -dev has been repaired but I won't be able to confirm it myself until mid-morning tomorrow. ... but finally:That certainly looks better. Transient network problems earlier?Begin processing the 1st record. Run 1, Event 1920001, LumiSection 3841 on stream 0 at 15-Sep-2022 21:51:36.256 CEST |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 25 |
That certainly looks better. Transient network problems earlier?I had and have no network problems. I'm not using any proxy. |
Send message Joined: 28 Jul 16 Posts: 484 Credit: 394,839 RAC: 0 |
The warnings are reported by the Frontier client inside the VM. Since at the end the jobs are running the fail-over obviously works. The issues can be, e.g. - 1 (or more) frontier server temporarily not responding => Frontier contacts the next one - a local network overload (mostly the router); too many open connections If it's the latter it is caused by a high peak load from Frontier. Frontier sends data in chunks (16 kB each IIRC) which concurrently opens lots of TCP connections. This works fine until the router's resources are fully used. New connections are not accepted until older (but idle) connections time out. A local Squid solves this since - in case of CMS up to 98 % of the Frontier requests can be returned by Squid - a local Squid usually doesn't use chunks |
©2024 CERN