Thread 'Dip?'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 2	Message 4496 - Posted: 15 Dec 2016, 11:02:01 UTC - in response to Message 4495. Don't worry, Test4Theory got a bonus last night :) ID: 4496 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 4497 - Posted: 15 Dec 2016, 11:30:52 UTC The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0. Which one was wrong? ID: 4497 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4498 - Posted: 15 Dec 2016, 13:36:14 UTC - in response to Message 4497. The interesting thing is that the running jobs graph dropped from 600 to 200, but the number of failed/successful jobs dropped to 0. Which one was wrong? Interesting question. It must at least be partly in how Dashboard calculates numbers from the messages it receives from Condor. I guess in the absence of a message that a job has finished, Dashboard considers it still running (until it deems it lost after 24 hours), so if Condor is unable to report, Dashboard will just say, "No jobs finished in the last hour, 200 are still running." ID: 4498 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4499 - Posted: 16 Dec 2016, 11:36:52 UTC Look at this peak: http://lhcathomedev.cern.ch/vLHCathome-dev/cms_job.php I didn´t expect that my old machine could be so fast. :-) ID: 4499 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 2	Message 4500 - Posted: 16 Dec 2016, 12:32:14 UTC - in response to Message 4499. Last modified: 16 Dec 2016, 19:10:24 UTC Sorry that is me :) We are testing an alternative job injection method. This will hopefully allow CMS operations to take ownership and free up Ivan from babysitting duties. Jobs from the WMAgent can be seen in purple. I am using some spare capacity from our OpenStack internal cloud. The VMs are directly created there rather than using BOINC so don't worry, I will not get any BOINC credit for this. ID: 4500 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4564 - Posted: 21 Dec 2016, 5:14:06 UTC - in response to Message 4563. 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] HTCondor ping 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 1 2016-12-20 22:59:37 (29181): Guest Log: [DEBUG] 12/20/16 22:59:37 recognized DC_NOP as command name, using command 60011. 2016-12-20 22:59:37 (29181): Guest Log: 12/20/16 22:59:37 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 2016-12-20 22:59:37 (29181): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623> 2016-12-20 22:59:37 (29181): Guest Log: [ERROR] Could not ping HTCondor. 2016-12-20 22:59:37 (29181): Guest Log: [INFO] Shutting Down. 2016-12-20 22:59:37 (29181): VM Completion File Detected. 2016-12-20 22:59:37 (29181): VM Completion Message: Could not ping HTCondor. Same here. ID: 4564 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1150 Credit: 342,328 RAC: 2	Message 4565 - Posted: 21 Dec 2016, 8:54:19 UTC - in response to Message 4564. Looks like it. Message sent to the admin. ID: 4565 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4566 - Posted: 21 Dec 2016, 9:34:46 UTC - in response to Message 4565. Looks like it. Message sent to the admin. /var has filled up again. I cleaned off a couple of large old log backups. ID: 4566 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 527 Credit: 400,710 RAC: 0	Message 4567 - Posted: 21 Dec 2016, 9:43:50 UTC - in response to Message 4566. Not enough? telnet lcggwms02.gridpp.rl.ac.uk 9623 Trying 130.246.180.120... telnet: connect to address 130.246.180.120: Connection refused ID: 4567 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4568 - Posted: 21 Dec 2016, 10:34:01 UTC - in response to Message 4567. Not enough? telnet lcggwms02.gridpp.rl.ac.uk 9623 Trying 130.246.180.120... telnet: connect to address 130.246.180.120: Connection refused Andrew must be working on it. I can log in to the server but Condor appears to have been stopped. ID: 4568 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4584 - Posted: 24 Dec 2016, 20:19:40 UTC Last modified: 24 Dec 2016, 23:22:28 UTC I've just submitted my second batch of WMAgent tasks, to be run by -dev volunteers and Laurence's cluster. The first task started to run down a bit past the numbers I expected, I'll have to check what the actual criterion is. There's a bit of a delay with WMAgent, as far as I can tell, so I'm not sure if the new tasks will make it into the queue before it runs dry -- there's a chance of an hour or three without jobs, but I hope I caught it in time. [Edit] Yep, caught it! [/Edit] [Edit^2] Phew, thought for a second there with the Jobs graphs that the new jobs weren't actually running, but on closer inspection it now seems that Dashboard is classifying the new batch as "wmagent" rather than "unknown" as it had for my first batch. Don't need scares like that at 2320 Christmas Eve... [/Edit^2] ID: 4584 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 4631 - Posted: 29 Jan 2017, 7:34:02 UTC No jobs running at all! Cannot get new tasks ID: 4631 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4632 - Posted: 29 Jan 2017, 11:09:43 UTC - in response to Message 4631. No jobs running at all! Cannot get new tasks Yes, something's happened to the queue. There are jobs ready to run but they don't seem to be getting to the Condor server. Investigating... ID: 4632 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 4633 - Posted: 30 Jan 2017, 9:33:53 UTC Running jobs are increasing. What was wrong and what can be done to stop it from happening again? ID: 4633 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4635 - Posted: 30 Jan 2017, 16:28:32 UTC - in response to Message 4633. Running jobs are increasing. What was wrong and what can be done to stop it from happening again? Apparently the error handler in the WMAgent system failed, so it wasn't clearing the slots of errored jobs and the slots filled up so no new jobs could be sent. This is a central facility, beyond our control (we don't even have login access) so the only thing we can do is suggest better monitoring. Seems I might have been able to see it if I'd known which buttons to press on the WMStatus display, and then I could have tried to raise a trouble ticket. ID: 4635 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4636 - Posted: 31 Jan 2017, 10:21:51 UTC - in response to Message 4635. Running jobs are increasing. What was wrong and what can be done to stop it from happening again? Apparently the error handler in the WMAgent system failed, so it wasn't clearing the slots of errored jobs and the slots filled up so no new jobs could be sent. This is a central facility, beyond our control (we don't even have login access) so the only thing we can do is suggest better monitoring. Seems I might have been able to see it if I'd known which buttons to press on the WMStatus display, and then I could have tried to raise a trouble ticket. It failed again last night, and I raised a ticket. Responsible has now implemented a watch-dog that restarts ErrorHandler if it's dead for more than 20 minutes. ID: 4636 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 3	Message 4639 - Posted: 31 Jan 2017, 17:17:57 UTC - in response to Message 4636. Very good. What was wrong and what can be done to stop it from happening again? I mentioned it, because the point of this project is exactly that. ID: 4639 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4641 - Posted: 31 Jan 2017, 21:53:17 UTC - in response to Message 4639. Last modified: 31 Jan 2017, 21:58:33 UTC Very good. What was wrong and what can be done to stop it from happening again? I mentioned it, because the point of this project is exactly that. Exactly! I can see, on a monitor that's only available to people with CERN credentials[1][2] (not necessarily CMS credentials), small notches in the idle jobs graphs that I think are the ErrorHandler falling over and getting back to its feet, but there is a larger variation, with an uneven 7-to-8 hour period in the running jobs graph which is mirrored by the spiky graph in our CMS job activity graph. I haven't yet identified the cause of this, but it's only been present in the last several days. [1]https://batch-carbon.cern.ch/grafana/dashboard/db/cluster-batch-jobs?var-cluster=vcpool&from=now-24h&to=now-5m [2] CMS@Home is user cmst1 ID: 4641 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4757 - Posted: 2 Mar 2017, 13:00:26 UTC I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted. ID: 4757 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 4763 - Posted: 2 Mar 2017, 22:56:06 UTC - in response to Message 4757. I've just noticed that something has gone wrong with job allocation. Perhaps best to set No New Tasks until it's sorted. Some jobs are flowing now, so you can try (cautiously) allowing new jobs again. ID: 4763 · Rating: 0 · rate: / Reply Quote

Development for LHC@home