Message boards : News : CMS@Home: Disruption to our condor server next Monday
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 6442 - Posted: 17 Jul 2019, 13:16:37 UTC
Last modified: 17 Jul 2019, 13:18:32 UTC

ID: 6442 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6444 - Posted: 17 Jul 2019, 20:35:29 UTC

ID: 6444 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 6489 - Posted: 22 Jul 2019, 15:29:59 UTC - in response to Message 6442.  
Last modified: 22 Jul 2019, 15:37:14 UTC

I have jobs coming in from -dev now, but none from the main project yet -- it still has a lot of services, including the feeder, showing as not running.

Hmm, but some of them are dying, VirtualBox is reporting inaccessible vdis.
ID: 6489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 6509 - Posted: 1 Aug 2019, 16:14:22 UTC

OK, we eventually found the problem and jobs are flowing again.
ID: 6509 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 6510 - Posted: 1 Aug 2019, 19:19:06 UTC

I will get back to work with these later tonight since I still have some late night high-speed left since all I had running was mainly Sixtracks and Einstein GPU's and they don't use any of that (and some Theory over at LHC)

I hadn't had any of the typical problems with internet speed and starting the VB tasks all of July but once again today I got several of those VB errors with the LHC Theory tasks but I'm just letting them run.

ERR_NETOPEN and EXIT_INIT_FAILURE (X7)
ID: 6510 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 6511 - Posted: 2 Aug 2019, 11:58:29 UTC - in response to Message 6510.  
Last modified: 2 Aug 2019, 11:59:30 UTC

I only get to start 6 two-core tasks since the goofy server will only give me one task per pc

It is blaming me for all the other ones that were past the due date and we know that is because of the server problems

AND I started up 12 tasks and here at 4:30am decided to look and make sure they are running BUT my account said I have NO tasks running so I come upstairs and see they have ALL been running for over 2.5 hours but I see the due date was July 30th

Not a great idea to have this happen especially over at LHC since that is all it takes to get members to leave or just not run ANY CMS tasks.

So after I aborted all of those running tasks I can look in my account and see the server says I didn't actually just do that........5am now

These better run because I am not going to be awake until around noon after this.
ID: 6511 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,798
RAC: 1,847
Message 6512 - Posted: 3 Aug 2019, 2:20:21 UTC

Oy Vey

Well after all of that I get 5 Valids and 5 Errors

Same old thing (I didn't even go up and check them since I figured they had to be running after all of that)

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2794831

Same old 2019-08-02 16:45:16 (6328): Guest Log: [DEBUG] HTCondor ping

2019-08-02 16:45:25 (6328): Guest Log: [DEBUG] 0

2019-08-02 17:23:41 (6328): Guest Log: [ERROR] Condor ended after 2305 seconds.

2019-08-02 17:23:41 (6328): Guest Log: [INFO] Shutting Down.

2019-08-02 17:23:41 (6328): VM Completion File Detected.
2019-08-02 17:23:41 (6328): VM Completion Message: Condor ended after 2305 seconds.

I'll try 14 more tonight (I paused some that had started running but I can tell would error out so I am not sure if I should even try them again so may just abort them and start new ones........I'll decide tonight) 7:20pm

I only have one more late-night high speed left for the month and then 10 days of maybe late night sort of higher speed
ID: 6512 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 6518 - Posted: 7 Aug 2019, 8:21:17 UTC

We've run down the CMS job queue to make some changes to the submission environment. Please set No New Tasks so that you don't have excessive churning waiting for jobs that are not available.
ID: 6518 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 6519 - Posted: 7 Aug 2019, 11:01:22 UTC

There's been a slight change in plans.
"Given that we do not need to redeploy the agent, but only kill jobs in condor and let them get recreated with the JobSubmitter/schedd changes, I think you can go ahead and submit another workflow to [keep] volunteers happy."
So, I'll continue to submit smaller batches and you can resume new tasks.
ID: 6519 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : CMS@Home: Disruption to our condor server next Monday


©2024 CERN