Thread 'Error rate going up'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2795 - Posted: 16 Apr 2016, 9:58:51 UTC - in response to Message 2785. All of the currently 5 fails are from the same host. Familiar IP... PM sent. Machine taken off-line. ID: 2795 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2796 - Posted: 16 Apr 2016, 10:02:59 UTC - in response to Message 2790. Last modified: 16 Apr 2016, 10:10:00 UTC Probably the best place to post this: We seem to have lost contact with the Condor server which doles out CRAB3 jobs: lcggwms02.gridpp.rl.ac.uk I can't ping it, log in to it or get its statistics on the "CMS Jobs" page. I've e-mailed RAL but don't expect a response at this time of (Friday) night. Ignore that; must have been a transient communications problem rather than a server problem. Contact re-established. Reply from RAL: There was a network "glitch" last night affecting everything in the RAL Tier-1. I don't know yet what caused it. ...and the official announcement: Date: Sat, 16 Apr 2016 09:53:44 +0100 From: EGI BROADCAST Subject: [ EGI BROADCAST ] Network problems at RAL Tier1 We have experienced a series of network outages overnight, affecting the Tier 1 and other RAL based services. A member of the network team is on-site. There is no time to fix yet. link to this broadcast : https://operations-portal.egi.eu/broadcast/archive/id/1357 ...so that may explain some of our ongoing problems (both CRAB and WMAgent submissions are via servers at RAL). ID: 2796 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2797 - Posted: 16 Apr 2016, 10:33:21 UTC Last modified: 16 Apr 2016, 10:34:17 UTC Thanks, Ivan. https://operations-portal.egi.eu/broadcast/archive/id/1357 Does not work.(permissions) Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. ID: 2797 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2798 - Posted: 16 Apr 2016, 10:56:47 UTC Last modified: 16 Apr 2016, 11:30:24 UTC CMS task on t4t site are shutting down after 5min. They do 3 runs and quit. EDIT: NO CMS TASK STARTS UP. ID: 2798 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 2799 - Posted: 16 Apr 2016, 13:21:02 UTC - in response to Message 2797. Thanks, Ivan. https://operations-portal.egi.eu/broadcast/archive/id/1357 Does not work.(permissions) Go to the home page and look under Latest News. ID: 2799 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2800 - Posted: 16 Apr 2016, 13:34:26 UTC - in response to Message 2799. Thanks, m. I allways thought, we are Tier 3. ID: 2800 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2803 - Posted: 16 Apr 2016, 15:18:29 UTC - in response to Message 2797. Thanks, Ivan. https://operations-portal.egi.eu/broadcast/archive/id/1357 Does not work.(permissions) I wasn't sure if it would or not; I included in case it did. Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-) ID: 2803 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2804 - Posted: 16 Apr 2016, 15:23:29 UTC - in response to Message 2800. Thanks, m. I allways thought, we are Tier 3. We are (T3_CH_Volunteer), but the Condor/WMAgent server VMs are hosted within the RAL Tier-1 site. To make it more confusing, there's a Tier-2 site at RAL too! (T0 is CERN, T1 is national, T2 is regional, T3 is normally institutional.) ID: 2804 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2805 - Posted: 16 Apr 2016, 17:33:07 UTC - in response to Message 2803. Last modified: 16 Apr 2016, 17:35:33 UTC Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-) [Edit] We've definitely had communications issues over the life of the last batch; Dashboard says there were 9193 successes to date, but I only count 8843 apparently-good result files on the Data Bridge. [/Edit] ID: 2805 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2806 - Posted: 16 Apr 2016, 19:08:41 UTC - in response to Message 2805. Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-) [Edit] We've definitely had communications issues over the life of the last batch; Dashboard says there were 9193 successes to date, but I only count 8843 apparently-good result files on the Data Bridge. [/Edit] # But did they ever return? # No they never returned, # And their fate is still unlearnt. # They may glide forever # 'Round the Internet fibres. # They're the jobs that never returned! https://www.youtube.com/watch?v=Dh994JcEfkI ID: 2806 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2807 - Posted: 16 Apr 2016, 19:20:47 UTC - in response to Message 2806. My guess is, they went where lost socks go. http://uncyclopedia.wikia.com/wiki/Missing_socks ID: 2807 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1000 Credit: 17,863,376 RAC: 19,316	Message 2808 - Posted: 16 Apr 2016, 20:15:43 UTC - in response to Message 2806. # But did they ever return? # No they never returned, # And their fate is still unlearnt. # They may glide forever # 'Round the Internet fibres. # They're the jobs that never returned! https://www.youtube.com/watch?v=Dh994JcEfkI Kingston Trio About the same age as NASA........and me........watching NASA on black and white TV ID: 2808 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2809 - Posted: 16 Apr 2016, 22:08:31 UTC - in response to Message 2808. Last modified: 16 Apr 2016, 22:11:20 UTC You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government. Another famous Kinston Trio song. Coincidentally, that song earwormed me for a day or two recently, with no trigger whatsoever that I recall! ID: 2809 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2810 - Posted: 16 Apr 2016, 22:15:13 UTC Last modified: 16 Apr 2016, 22:21:31 UTC BTW, you might have noticed that job rates have fallen in the past few hours. I've not found a reason for that, and I'm about to slope off for a good night's sleep. Investigation will continue about 1000Z tomorrow! [Edit] Ah, I had to look one more time before bed... WMAgent jobs are rising, pre-empting CRAB3 jobs. (I'll go to sleep now, promise!) [/Edit] ID: 2810 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 1000 Credit: 17,863,376 RAC: 19,316	Message 2811 - Posted: 17 Apr 2016, 5:34:45 UTC - in response to Message 2809. You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government. Another famous Kingston Trio song. Coincidentally, that song earwormed me for a day or two recently, with no trigger whatsoever that I recall! THAT is a Kingston Trio classic......and I know if I play that song I will still have it in my head when I wake up tomorrow I think that is one of those things that happen to us oldtimers Ivan These days I can get them stuck in my head for days and some are ones I would rather not have in my head at all! I even get those from old tv shows I watch these days since I have a channel that shows the old classics from the 50's,60's,and 70's The only way I can get rid of them is play another one from the past that I don't mind.....even after a day or two. Ok now as far as our tasks here.....as usual if I don't watch my hosts 24/7 those "Error while computing" will run that streak over and over every 10 minutes. So when I checked I had a Valid task and then 14 Error while computing in a row so I stopped them and did a reboot and got a new task running and set them to NOT get new ones just in case that started again. So far one host got another Valid and once again it gave me a ERROR on the CMS in 10 minutes so since I was watching I took a few minutes to do a Windows 10 update and reboot and tried again and this time got a LHCb ERROR. Since it is set to not get new tasks it won't start another streak (it just happened on that desktop upstairs and I am back in the livingroom on the laptop) and I tried one on here just to see if it would run one since I suspended my Atlas tasks......no luck so it is back off my host list here. My other 2 hosts still have tasks running that I started yesterday so maybe they will finish Valid. They all have looked ok when I watched them start on the VM Console BUT what I have been getting on those ERRORS when I look at the stderr is VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') Not sure if it is just on the server end having trouble connecting the opposite side of the planet or my DSL snail. But then ALL 7 of my hosts run GPU's for Einstein 24/7 and never have a problem and the same with my vLHC tasks and Atlas tasks. I will watch the two I have running since I tend to stay up until 3am and they must be getting close and see if I can get the other two hosts to get back to work without the errors after 10 minutes. ID: 2811 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,863 RAC: 78	Message 2812 - Posted: 17 Apr 2016, 9:50:56 UTC - in response to Message 2811. BUT what I have been getting on those ERRORS when I look at the stderr is VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') Not sure if it is just on the server end having trouble connecting the opposite side of the planet or my DSL snail. The bolded kind of message is a local problem from your VM and host, not from a connection going outside your home. ID: 2812 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2854 - Posted: 19 Apr 2016, 14:04:56 UTC I guess, the previos batch has been abandoned ? There has been next to no change in the past few days and there are still 100+ pending. ID: 2854 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2855 - Posted: 19 Apr 2016, 14:11:12 UTC - in response to Message 2854. I guess, the previos batch has been abandoned ? There has been next to no change in the past few days and there are still 100+ pending. Like I said, they are welcome to report in, but my guess is that they have been lost in the communications problems at RAL. ID: 2855 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 2926 - Posted: 22 Apr 2016, 8:13:52 UTC "unknown" and "WNPostproc" state jobs are rising rapidly. ID: 2926 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 165	Message 2929 - Posted: 22 Apr 2016, 9:48:58 UTC - in response to Message 2926. "unknown" and "WNPostproc" state jobs are rising rapidly. I think there's a communications problem somewhere, but I haven't been able to pinpoint it yet. ID: 2929 · Rating: 0 · rate: / Reply Quote

Development for LHC@home