Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 4496 - Posted: 15 Dec 2016, 11:02:01 UTC - in response to Message 4495.  

Don't worry, Test4Theory got a bonus last night :)
ID: 4496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4497 - Posted: 15 Dec 2016, 11:30:52 UTC

The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0.

Which one was wrong?
ID: 4497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4498 - Posted: 15 Dec 2016, 13:36:14 UTC - in response to Message 4497.  

The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0.

Which one was wrong?

Interesting question. It must at least be partly in how Dashboard calculates numbers from the messages it receives from Condor. I guess in the absence of a message that a job has finished, Dashboard considers it still running (until it deems it lost after 24 hours), so if Condor is unable to report, Dashboard will just say, "No jobs finished in the last hour, 200 are still running."
ID: 4498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 68
Message 4499 - Posted: 16 Dec 2016, 11:36:52 UTC

Look at this peak:
http://lhcathomedev.cern.ch/vLHCathome-dev/cms_job.php
I didnĀ“t expect that my old machine could be so fast.
:-)
ID: 4499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 4500 - Posted: 16 Dec 2016, 12:32:14 UTC - in response to Message 4499.  
Last modified: 16 Dec 2016, 19:10:24 UTC

Sorry that is me :) We are testing an alternative job injection method. This will hopefully allow CMS operations to take ownership and free up Ivan from babysitting duties. Jobs from the WMAgent can be seen in purple. I am using some spare capacity from our OpenStack internal cloud. The VMs are directly created there rather than using BOINC so don't worry, I will not get any BOINC credit for this.
ID: 4500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 68
Message 4564 - Posted: 21 Dec 2016, 5:14:06 UTC - in response to Message 4563.  

2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 1
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 12/20/16 22:59:37 recognized DC_NOP as command name, using command 60011.
2016-12-20 22:59:37 (29181): Guest Log: 12/20/16 22:59:37 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
2016-12-20 22:59:37 (29181): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623>
2016-12-20 22:59:37 (29181): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-20 22:59:37 (29181): Guest Log: [INFO] Shutting Down.
2016-12-20 22:59:37 (29181): VM Completion File Detected.
2016-12-20 22:59:37 (29181): VM Completion Message: Could not ping HTCondor.

Same here.
ID: 4564 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 326,570
RAC: 95
Message 4565 - Posted: 21 Dec 2016, 8:54:19 UTC - in response to Message 4564.  

Looks like it. Message sent to the admin.
ID: 4565 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4566 - Posted: 21 Dec 2016, 9:34:46 UTC - in response to Message 4565.  

Looks like it. Message sent to the admin.

/var has filled up again. I cleaned off a couple of large old log backups.
ID: 4566 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 473
Credit: 389,411
RAC: 68
Message 4567 - Posted: 21 Dec 2016, 9:43:50 UTC - in response to Message 4566.  

Not enough?

telnet lcggwms02.gridpp.rl.ac.uk 9623
Trying 130.246.180.120...
telnet: connect to address 130.246.180.120: Connection refused
ID: 4567 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4568 - Posted: 21 Dec 2016, 10:34:01 UTC - in response to Message 4567.  

Not enough?

telnet lcggwms02.gridpp.rl.ac.uk 9623
Trying 130.246.180.120...
telnet: connect to address 130.246.180.120: Connection refused

Andrew must be working on it. I can log in to the server but Condor appears to have been stopped.
ID: 4568 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4584 - Posted: 24 Dec 2016, 20:19:40 UTC
Last modified: 24 Dec 2016, 23:22:28 UTC

I've just submitted my second batch of WMAgent tasks, to be run by -dev volunteers and Laurence's cluster. The first task started to run down a bit past the numbers I expected, I'll have to check what the actual criterion is. There's a bit of a delay with WMAgent, as far as I can tell, so I'm not sure if the new tasks will make it into the queue before it runs dry -- there's a chance of an hour or three without jobs, but I hope I caught it in time.

[Edit] Yep, caught it! [/Edit]

[Edit^2] Phew, thought for a second there with the Jobs graphs that the new jobs weren't actually running, but on closer inspection it now seems that Dashboard is classifying the new batch as "wmagent" rather than "unknown" as it had for my first batch. Don't need scares like that at 2320 Christmas Eve... [/Edit^2]
ID: 4584 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4631 - Posted: 29 Jan 2017, 7:34:02 UTC

No jobs running at all!
Cannot get new tasks
ID: 4631 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4632 - Posted: 29 Jan 2017, 11:09:43 UTC - in response to Message 4631.  

No jobs running at all!
Cannot get new tasks

Yes, something's happened to the queue. There are jobs ready to run but they don't seem to be getting to the Condor server. Investigating...
ID: 4632 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4633 - Posted: 30 Jan 2017, 9:33:53 UTC

Running jobs are increasing.

What was wrong and what can be done to stop it from happening again?
ID: 4633 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4635 - Posted: 30 Jan 2017, 16:28:32 UTC - in response to Message 4633.  

Running jobs are increasing.

What was wrong and what can be done to stop it from happening again?

Apparently the error handler in the WMAgent system failed, so it wasn't clearing the slots of errored jobs and the slots filled up so no new jobs could be sent. This is a central facility, beyond our control (we don't even have login access) so the only thing we can do is suggest better monitoring. Seems I might have been able to see it if I'd known which buttons to press on the WMStatus display, and then I could have tried to raise a trouble ticket.
ID: 4635 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4636 - Posted: 31 Jan 2017, 10:21:51 UTC - in response to Message 4635.  

Running jobs are increasing.

What was wrong and what can be done to stop it from happening again?

Apparently the error handler in the WMAgent system failed, so it wasn't clearing the slots of errored jobs and the slots filled up so no new jobs could be sent. This is a central facility, beyond our control (we don't even have login access) so the only thing we can do is suggest better monitoring. Seems I might have been able to see it if I'd known which buttons to press on the WMStatus display, and then I could have tried to raise a trouble ticket.

It failed again last night, and I raised a ticket. Responsible has now implemented a watch-dog that restarts ErrorHandler if it's dead for more than 20 minutes.
ID: 4636 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4639 - Posted: 31 Jan 2017, 17:17:57 UTC - in response to Message 4636.  

Very good.

What was wrong and what can be done to stop it from happening again?


I mentioned it, because the point of this project is exactly that.
ID: 4639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4641 - Posted: 31 Jan 2017, 21:53:17 UTC - in response to Message 4639.  
Last modified: 31 Jan 2017, 21:58:33 UTC

Very good.

What was wrong and what can be done to stop it from happening again?


I mentioned it, because the point of this project is exactly that.

Exactly!
I can see, on a monitor that's only available to people with CERN credentials[1][2] (not necessarily CMS credentials), small notches in the idle jobs graphs that I think are the ErrorHandler falling over and getting back to its feet, but there is a larger variation, with an uneven 7-to-8 hour period in the running jobs graph which is mirrored by the spiky graph in our CMS job activity graph. I haven't yet identified the cause of this, but it's only been present in the last several days.

[1]https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m
[2] CMS@Home is user cmst1
ID: 4641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4757 - Posted: 2 Mar 2017, 13:00:26 UTC

I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted.
ID: 4757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 172
Message 4763 - Posted: 2 Mar 2017, 22:56:06 UTC - in response to Message 4757.  

I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted.

Some jobs are flowing now, so you can try (cautiously) allowing new jobs again.
ID: 4763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN