Message boards :
CMS Application :
Error rate going up
Message board moderation
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
In the past 2 hours we had more errors than in the past 2 days. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Tja, but we're still less than 0.5%, which is stellar performance compared to some runs in the past. I'll keep an eye out to see if there's any trend; the pie chart in the job activity page is a good place to see any upsurge in any particular failure mode. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
However, there seem to be "bursts" of errors occuring. Within a very short period of time(minutes, maybe an hour), the number of errors are densly packed, which indicates a server/internet issue. Is there a Cern server load diagram, that shows server loads over time? Or somthing like it for the internet?(Well, i can look that up myself) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
However, there seem to be "bursts" of errors occuring. Are you seeing the IP addresses for the failing machines? If so, are they the same host, or scattered around? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
They are scattered. Maybe i should just shut up and enjoy the low rate. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Large numbers of jobs in WNpostproc and unknown state. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Large numbers of jobs in WNpostproc and unknown state. That may be related to some config changes I was asked to make last night, as the PheDEx server wasn't restarted until today. I'll keep an eye on it. [Edit] I looked at several of the WNPostProc jobs and they'd all finished successfully and prepared the report for Dashboard. Looks like something's preventing Dashboard from getting the reports. [/Edit] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks Ivan. Why did the number of running task increase so much so quickly? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Thanks Ivan. I think Hassen is running more WMAgent jobs, but I haven't been able to verify it -- I actually just sent him an e-mail to ask exactly that! Understand that the "headline" figures on the Activities page (NNN queued, MMM running) are essentially derived from a "condor_q" cron job that runs every minute on the Condor server at RAL, and stuffs the result into a Web-accessible location. So that relates to just "my" CRAB3 submissions. The graphs, on the other hand, (as far as I can decipher their URLs) come from all the jobs that Dashboard knows about that are/have been running on the T3_CH_Volunteer Grid pseudo-site, which can include the WMAgent submissions as well. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The tasks in progress on the T4T server are also high. Why would suddenly so many more than before join the cms tasks? I was watching hassens jobs, and there was no activity. edit: there are also about 16 running tasks that have a finish? time before they started???? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
OK, you are asking questions outside my knowledge zone now. Hopefully someone from CERN IT will pick up later today. G'night! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks for trying. The previous batch went very well and i just wanted to know, why ,since yesterday morning, all goes pear-shaped,again. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Thanks for trying. You and me both. Oh, well; a new day, new challenges. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 861,475 RAC: 2 |
I was watching hassens jobs, and there was no activity. A lot of Hassen's tasks from early April are resubmitted on the 14th of April and most of them failed - 5429 aborted One of the batches resubmitted: wmagent_riahi_TEST_Volunteer_RAL_0911-T3_CH_VolunteerBackfill_160404_230929_2784 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
Can you guys reach this web-page? I'm not sure if it's publicly accessible or not, but it does give some separation between CRAB and WMAgent jobs (i.e., me and Hassen ATM). There are other views to be seen with the "Plotting Category" buttons on the left, if you get in. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Yes, it works. There are currently at least 3 different WMAgent batches running. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
If you go here http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=133 and at step 3, click "Activities/Any activity" or whatever seems appropriate and under "Submission Tools" you can select wmagent or crab3 to see separate results. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
All of the currently 5 fails are from the same host. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
All of the currently 5 fails are from the same host. Familiar IP... PM sent. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 728 |
We seem to have lost contact with the Condor server which doles out CRAB3 jobs: lcggwms02.gridpp.rl.ac.uk I can't ping it, log in to it or get its statistics on the "CMS Jobs" page. I've e-mailed RAL but don't expect a response at this time of (Friday) night. Ignore that; must have been a transient communications problem rather than a server problem. Contact re-established. |
©2024 CERN