Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Don't worry, Test4Theory got a bonus last night :) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0. Which one was wrong? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0. Interesting question. It must at least be partly in how Dashboard calculates numbers from the messages it receives from Condor. I guess in the absence of a message that a job has finished, Dashboard considers it still running (until it deems it lost after 24 hours), so if Condor is unable to report, Dashboard will just say, "No jobs finished in the last hour, 200 are still running." |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
Look at this peak: http://lhcathomedev.cern.ch/vLHCathome-dev/cms_job.php I didnĀ“t expect that my old machine could be so fast. :-) |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Sorry that is me :) We are testing an alternative job injection method. This will hopefully allow CMS operations to take ownership and free up Ivan from babysitting duties. Jobs from the WMAgent can be seen in purple. I am using some spare capacity from our OpenStack internal cloud. The VMs are directly created there rather than using BOINC so don't worry, I will not get any BOINC credit for this. |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping Same here. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Looks like it. Message sent to the admin. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
Looks like it. Message sent to the admin. /var has filled up again. I cleaned off a couple of large old log backups. |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
Not enough? telnet lcggwms02.gridpp.rl.ac.uk 9623 Trying 130.246.180.120... telnet: connect to address 130.246.180.120: Connection refused |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
Not enough? Andrew must be working on it. I can log in to the server but Condor appears to have been stopped. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
I've just submitted my second batch of WMAgent tasks, to be run by -dev volunteers and Laurence's cluster. The first task started to run down a bit past the numbers I expected, I'll have to check what the actual criterion is. There's a bit of a delay with WMAgent, as far as I can tell, so I'm not sure if the new tasks will make it into the queue before it runs dry -- there's a chance of an hour or three without jobs, but I hope I caught it in time. [Edit] Yep, caught it! [/Edit] [Edit^2] Phew, thought for a second there with the Jobs graphs that the new jobs weren't actually running, but on closer inspection it now seems that Dashboard is classifying the new batch as "wmagent" rather than "unknown" as it had for my first batch. Don't need scares like that at 2320 Christmas Eve... [/Edit^2] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
No jobs running at all! Cannot get new tasks |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
No jobs running at all! Yes, something's happened to the queue. There are jobs ready to run but they don't seem to be getting to the Condor server. Investigating... |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Running jobs are increasing. What was wrong and what can be done to stop it from happening again? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
Running jobs are increasing. Apparently the error handler in the WMAgent system failed, so it wasn't clearing the slots of errored jobs and the slots filled up so no new jobs could be sent. This is a central facility, beyond our control (we don't even have login access) so the only thing we can do is suggest better monitoring. Seems I might have been able to see it if I'd known which buttons to press on the WMStatus display, and then I could have tried to raise a trouble ticket. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
Running jobs are increasing. It failed again last night, and I raised a ticket. Responsible has now implemented a watch-dog that restarts ErrorHandler if it's dead for more than 20 minutes. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Very good. What was wrong and what can be done to stop it from happening again? I mentioned it, because the point of this project is exactly that. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
Very good. Exactly! I can see, on a monitor that's only available to people with CERN credentials[1][2] (not necessarily CMS credentials), small notches in the idle jobs graphs that I think are the ErrorHandler falling over and getting back to its feet, but there is a larger variation, with an uneven 7-to-8 hour period in the running jobs graph which is mirrored by the spiky graph in our CMS job activity graph. I haven't yet identified the cause of this, but it's only been present in the last several days. [1]https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m [2] CMS@Home is user cmst1 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 13 |
I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted. Some jobs are flowing now, so you can try (cautiously) allowing new jobs again. |
©2025 CERN