Thread 'Error rate going up'

Author	Message
Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2615 - Posted: 8 Apr 2016, 20:22:23 UTC In the past 2 hours we had more errors than in the past 2 days. ID: 2615 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2616 - Posted: 9 Apr 2016, 9:28:15 UTC - in response to Message 2615. Tja, but we're still less than 0.5%, which is stellar performance compared to some runs in the past. I'll keep an eye out to see if there's any trend; the pie chart in the job activity page is a good place to see any upsurge in any particular failure mode. ID: 2616 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2619 - Posted: 9 Apr 2016, 12:46:26 UTC - in response to Message 2616. However, there seem to be "bursts" of errors occuring. Within a very short period of time(minutes, maybe an hour), the number of errors are densly packed, which indicates a server/internet issue. Is there a Cern server load diagram, that shows server loads over time? Or somthing like it for the internet?(Well, i can look that up myself) ID: 2619 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2621 - Posted: 9 Apr 2016, 14:55:34 UTC - in response to Message 2619. Last modified: 9 Apr 2016, 14:56:41 UTC However, there seem to be "bursts" of errors occuring. Within a very short period of time(minutes, maybe an hour), the number of errors are densly packed, which indicates a server/internet issue. Is there a Cern server load diagram, that shows server loads over time? Or somthing like it for the internet?(Well, i can look that up myself) Are you seeing the IP addresses for the failing machines? If so, are they the same host, or scattered around? ID: 2621 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2622 - Posted: 9 Apr 2016, 15:51:28 UTC - in response to Message 2621. They are scattered. Maybe i should just shut up and enjoy the low rate. ID: 2622 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2755 - Posted: 14 Apr 2016, 14:12:31 UTC Large numbers of jobs in WNpostproc and unknown state. ID: 2755 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2757 - Posted: 14 Apr 2016, 15:38:58 UTC - in response to Message 2755. Last modified: 14 Apr 2016, 15:51:25 UTC Large numbers of jobs in WNpostproc and unknown state. That may be related to some config changes I was asked to make last night, as the PheDEx server wasn't restarted until today. I'll keep an eye on it. [Edit] I looked at several of the WNPostProc jobs and they'd all finished successfully and prepared the report for Dashboard. Looks like something's preventing Dashboard from getting the reports. [/Edit] ID: 2757 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2763 - Posted: 14 Apr 2016, 19:15:06 UTC - in response to Message 2757. Thanks Ivan. Why did the number of running task increase so much so quickly? ID: 2763 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2764 - Posted: 14 Apr 2016, 22:52:26 UTC - in response to Message 2763. Thanks Ivan. Why did the number of running task increase so much so quickly? I think Hassen is running more WMAgent jobs, but I haven't been able to verify it -- I actually just sent him an e-mail to ask exactly that! Understand that the "headline" figures on the Activities page (NNN queued, MMM running) are essentially derived from a "condor_q" cron job that runs every minute on the Condor server at RAL, and stuffs the result into a Web-accessible location. So that relates to just "my" CRAB3 submissions. The graphs, on the other hand, (as far as I can decipher their URLs) come from all the jobs that Dashboard knows about that are/have been running on the T3_CH_Volunteer Grid pseudo-site, which can include the WMAgent submissions as well. ID: 2764 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2765 - Posted: 14 Apr 2016, 22:56:43 UTC - in response to Message 2764. Last modified: 14 Apr 2016, 23:19:18 UTC The tasks in progress on the T4T server are also high. Why would suddenly so many more than before join the cms tasks? I was watching hassens jobs, and there was no activity. edit: there are also about 16 running tasks that have a finish? time before they started???? ID: 2765 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2766 - Posted: 14 Apr 2016, 23:28:39 UTC - in response to Message 2765. Last modified: 14 Apr 2016, 23:29:14 UTC OK, you are asking questions outside my knowledge zone now. Hopefully someone from CERN IT will pick up later today. G'night! ID: 2766 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2768 - Posted: 15 Apr 2016, 7:36:02 UTC - in response to Message 2766. Last modified: 15 Apr 2016, 7:37:52 UTC Thanks for trying. The previous batch went very well and i just wanted to know, why ,since yesterday morning, all goes pear-shaped,again. ID: 2768 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2769 - Posted: 15 Apr 2016, 7:41:41 UTC - in response to Message 2768. Last modified: 15 Apr 2016, 7:41:54 UTC Thanks for trying. The previous batch went very well and i just wanted to know, why ,since yesterday morning, all goes pear-shaped,again. You and me both. Oh, well; a new day, new challenges. ID: 2769 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 2770 - Posted: 15 Apr 2016, 8:00:17 UTC - in response to Message 2765. Last modified: 15 Apr 2016, 8:00:46 UTC I was watching hassens jobs, and there was no activity. A lot of Hassen's tasks from early April are resubmitted on the 14th of April and most of them failed - 5429 aborted One of the batches resubmitted: wmagent_riahi_TEST_Volunteer_RAL_0911-T3_CH_VolunteerBackfill_160404_230929_2784 ID: 2770 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2780 - Posted: 15 Apr 2016, 18:24:24 UTC Can you guys reach this web-page? I'm not sure if it's publicly accessible or not, but it does give some separation between CRAB and WMAgent jobs (i.e., me and Hassen ATM). There are other views to be seen with the "Plotting Category" buttons on the left, if you get in. ID: 2780 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2781 - Posted: 15 Apr 2016, 18:32:32 UTC - in response to Message 2780. Yes, it works. There are currently at least 3 different WMAgent batches running. ID: 2781 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2782 - Posted: 15 Apr 2016, 21:26:49 UTC - in response to Message 2780. Last modified: 15 Apr 2016, 21:52:12 UTC If you go here http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=133 and at step 3, click "Activities/Any activity" or whatever seems appropriate and under "Submission Tools" you can select wmagent or crab3 to see separate results. ID: 2782 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2783 - Posted: 15 Apr 2016, 21:27:54 UTC All of the currently 5 fails are from the same host. ID: 2783 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2785 - Posted: 15 Apr 2016, 21:46:56 UTC - in response to Message 2783. All of the currently 5 fails are from the same host. Familiar IP... PM sent. ID: 2785 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2790 - Posted: 15 Apr 2016, 23:42:30 UTC Last modified: 15 Apr 2016, 23:49:45 UTC Probably the best place to post this: We seem to have lost contact with the Condor server which doles out CRAB3 jobs: lcggwms02.gridpp.rl.ac.uk I can't ping it, log in to it or get its statistics on the "CMS Jobs" page. I've e-mailed RAL but don't expect a response at this time of (Friday) night. Ignore that; must have been a transient communications problem rather than a server problem. Contact re-established. ID: 2790 · Rating: 0 · rate: / Reply Quote

Development for LHC@home