Message boards : CMS Application : Error rate going up
Message board moderation

To post messages, you must log in.

1 · 2 · 3 · 4 · Next

AuthorMessage
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2615 - Posted: 8 Apr 2016, 20:22:23 UTC

In the past 2 hours we had more errors than in the past 2 days.
ID: 2615 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2616 - Posted: 9 Apr 2016, 9:28:15 UTC - in response to Message 2615.  

Tja, but we're still less than 0.5%, which is stellar performance compared to some runs in the past. I'll keep an eye out to see if there's any trend; the pie chart in the job activity page is a good place to see any upsurge in any particular failure mode.
ID: 2616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2619 - Posted: 9 Apr 2016, 12:46:26 UTC - in response to Message 2616.  

However, there seem to be "bursts" of errors occuring.
Within a very short period of time(minutes, maybe an hour), the number of errors are densly packed, which indicates a server/internet issue.
Is there a Cern server load diagram, that shows server loads over time?
Or somthing like it for the internet?(Well, i can look that up myself)
ID: 2619 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2621 - Posted: 9 Apr 2016, 14:55:34 UTC - in response to Message 2619.  
Last modified: 9 Apr 2016, 14:56:41 UTC

However, there seem to be "bursts" of errors occuring.
Within a very short period of time(minutes, maybe an hour), the number of errors are densly packed, which indicates a server/internet issue.
Is there a Cern server load diagram, that shows server loads over time?
Or somthing like it for the internet?(Well, i can look that up myself)

Are you seeing the IP addresses for the failing machines? If so, are they the same host, or scattered around?
ID: 2621 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2622 - Posted: 9 Apr 2016, 15:51:28 UTC - in response to Message 2621.  

They are scattered.
Maybe i should just shut up and enjoy the low rate.
ID: 2622 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2755 - Posted: 14 Apr 2016, 14:12:31 UTC

Large numbers of jobs in WNpostproc and unknown state.
ID: 2755 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2757 - Posted: 14 Apr 2016, 15:38:58 UTC - in response to Message 2755.  
Last modified: 14 Apr 2016, 15:51:25 UTC

Large numbers of jobs in WNpostproc and unknown state.

That may be related to some config changes I was asked to make last night, as the PheDEx server wasn't restarted until today. I'll keep an eye on it.

[Edit] I looked at several of the WNPostProc jobs and they'd all finished successfully and prepared the report for Dashboard. Looks like something's preventing Dashboard from getting the reports. [/Edit]
ID: 2757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2763 - Posted: 14 Apr 2016, 19:15:06 UTC - in response to Message 2757.  

Thanks Ivan.
Why did the number of running task increase so much so quickly?
ID: 2763 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2764 - Posted: 14 Apr 2016, 22:52:26 UTC - in response to Message 2763.  

Thanks Ivan.
Why did the number of running task increase so much so quickly?

I think Hassen is running more WMAgent jobs, but I haven't been able to verify it -- I actually just sent him an e-mail to ask exactly that!
Understand that the "headline" figures on the Activities page (NNN queued, MMM running) are essentially derived from a "condor_q" cron job that runs every minute on the Condor server at RAL, and stuffs the result into a Web-accessible location. So that relates to just "my" CRAB3 submissions. The graphs, on the other hand, (as far as I can decipher their URLs) come from all the jobs that Dashboard knows about that are/have been running on the T3_CH_Volunteer Grid pseudo-site, which can include the WMAgent submissions as well.
ID: 2764 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2765 - Posted: 14 Apr 2016, 22:56:43 UTC - in response to Message 2764.  
Last modified: 14 Apr 2016, 23:19:18 UTC

The tasks in progress on the T4T server are also high.
Why would suddenly so many more than before join the cms tasks?

I was watching hassens jobs, and there was no activity.

edit: there are also about 16 running tasks that have a finish? time before they started????
ID: 2765 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2766 - Posted: 14 Apr 2016, 23:28:39 UTC - in response to Message 2765.  
Last modified: 14 Apr 2016, 23:29:14 UTC

OK, you are asking questions outside my knowledge zone now. Hopefully someone from CERN IT will pick up later today. G'night!
ID: 2766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2768 - Posted: 15 Apr 2016, 7:36:02 UTC - in response to Message 2766.  
Last modified: 15 Apr 2016, 7:37:52 UTC

Thanks for trying.
The previous batch went very well and i just wanted to know, why ,since yesterday morning, all goes pear-shaped,again.
ID: 2768 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2769 - Posted: 15 Apr 2016, 7:41:41 UTC - in response to Message 2768.  
Last modified: 15 Apr 2016, 7:41:54 UTC

Thanks for trying.
The previous batch went very well and i just wanted to know, why ,since yesterday morning, all goes pear-shaped,again.

You and me both. Oh, well; a new day, new challenges.
ID: 2769 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 2770 - Posted: 15 Apr 2016, 8:00:17 UTC - in response to Message 2765.  
Last modified: 15 Apr 2016, 8:00:46 UTC

I was watching hassens jobs, and there was no activity.


A lot of Hassen's tasks from early April are resubmitted on the 14th of April and most of them failed - 5429 aborted

One of the batches resubmitted: wmagent_riahi_TEST_Volunteer_RAL_0911-T3_CH_VolunteerBackfill_160404_230929_2784
ID: 2770 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2780 - Posted: 15 Apr 2016, 18:24:24 UTC

Can you guys reach this web-page? I'm not sure if it's publicly accessible or not, but it does give some separation between CRAB and WMAgent jobs (i.e., me and Hassen ATM). There are other views to be seen with the "Plotting Category" buttons on the left, if you get in.
ID: 2780 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2781 - Posted: 15 Apr 2016, 18:32:32 UTC - in response to Message 2780.  

Yes, it works.
There are currently at least 3 different WMAgent batches running.
ID: 2781 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 874,518
RAC: 460
Message 2782 - Posted: 15 Apr 2016, 21:26:49 UTC - in response to Message 2780.  
Last modified: 15 Apr 2016, 21:52:12 UTC

If you go here http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=133 and at step 3, click "Activities/Any activity" or whatever seems appropriate and under "Submission Tools" you can select wmagent or crab3 to see separate results.
ID: 2782 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2783 - Posted: 15 Apr 2016, 21:27:54 UTC

All of the currently 5 fails are from the same host.
ID: 2783 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2785 - Posted: 15 Apr 2016, 21:46:56 UTC - in response to Message 2783.  

All of the currently 5 fails are from the same host.

Familiar IP... PM sent.
ID: 2785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2790 - Posted: 15 Apr 2016, 23:42:30 UTC
Last modified: 15 Apr 2016, 23:49:45 UTC

Probably the best place to post this:

We seem to have lost contact with the Condor server which doles out CRAB3 jobs:
lcggwms02.gridpp.rl.ac.uk
I can't ping it, log in to it or get its statistics on the "CMS Jobs" page. I've e-mailed RAL but don't expect a response at this time of (Friday) night.

Ignore that; must have been a transient communications problem rather than a server problem. Contact re-established.
ID: 2790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · 3 · 4 · Next

Message boards : CMS Application : Error rate going up


©2024 CERN