Message boards :
News :
CMS@Home: Disruption to our condor server next Monday
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
|
Send message Joined: 28 Jul 16 Posts: 482 Credit: 394,720 RAC: 0 |
CMS may be affected earlier: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5087&postid=39380 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
I have jobs coming in from -dev now, but none from the main project yet -- it still has a lot of services, including the feeder, showing as not running. Hmm, but some of them are dying, VirtualBox is reporting inaccessible vdis. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
OK, we eventually found the problem and jobs are flowing again. |
Send message Joined: 8 Apr 15 Posts: 780 Credit: 12,151,079 RAC: 2,173 |
I will get back to work with these later tonight since I still have some late night high-speed left since all I had running was mainly Sixtracks and Einstein GPU's and they don't use any of that (and some Theory over at LHC) I hadn't had any of the typical problems with internet speed and starting the VB tasks all of July but once again today I got several of those VB errors with the LHC Theory tasks but I'm just letting them run. ERR_NETOPEN and EXIT_INIT_FAILURE (X7) |
Send message Joined: 8 Apr 15 Posts: 780 Credit: 12,151,079 RAC: 2,173 |
I only get to start 6 two-core tasks since the goofy server will only give me one task per pc It is blaming me for all the other ones that were past the due date and we know that is because of the server problems AND I started up 12 tasks and here at 4:30am decided to look and make sure they are running BUT my account said I have NO tasks running so I come upstairs and see they have ALL been running for over 2.5 hours but I see the due date was July 30th Not a great idea to have this happen especially over at LHC since that is all it takes to get members to leave or just not run ANY CMS tasks. So after I aborted all of those running tasks I can look in my account and see the server says I didn't actually just do that........5am now These better run because I am not going to be awake until around noon after this. |
Send message Joined: 8 Apr 15 Posts: 780 Credit: 12,151,079 RAC: 2,173 |
Oy Vey Well after all of that I get 5 Valids and 5 Errors Same old thing (I didn't even go up and check them since I figured they had to be running after all of that) https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2794831 Same old 2019-08-02 16:45:16 (6328): Guest Log: [DEBUG] HTCondor ping 2019-08-02 16:45:25 (6328): Guest Log: [DEBUG] 0 2019-08-02 17:23:41 (6328): Guest Log: [ERROR] Condor ended after 2305 seconds. 2019-08-02 17:23:41 (6328): Guest Log: [INFO] Shutting Down. 2019-08-02 17:23:41 (6328): VM Completion File Detected. 2019-08-02 17:23:41 (6328): VM Completion Message: Condor ended after 2305 seconds. I'll try 14 more tonight (I paused some that had started running but I can tell would error out so I am not sure if I should even try them again so may just abort them and start new ones........I'll decide tonight) 7:20pm I only have one more late-night high speed left for the month and then 10 days of maybe late night sort of higher speed |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
We've run down the CMS job queue to make some changes to the submission environment. Please set No New Tasks so that you don't have excessive churning waiting for jobs that are not available. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,181,211 RAC: 2,023 |
There's been a slight change in plans. "Given that we do not need to redeploy the agent, but only kill jobs in condor and let them get recreated with the JobSubmitter/schedd changes, I think you can go ahead and submit another workflow to [keep] volunteers happy." So, I'll continue to submit smaller batches and you can resume new tasks. |
©2024 CERN