Message boards : CMS Application : Error rate going up
Message board moderation
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
All of the currently 5 fails are from the same host. Machine taken off-line. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
Reply from RAL: There was a network "glitch" last night affecting everything in the RAL Tier-1. I don't know yet what caused it. ...and the official announcement: Date: Sat, 16 Apr 2016 09:53:44 +0100 From: EGI BROADCAST Subject: [ EGI BROADCAST ] Network problems at RAL Tier1 We have experienced a series of network outages overnight, affecting the Tier 1 and other RAL based services. A member of the network team is on-site. There is no time to fix yet. link to this broadcast : https://operations-portal.egi.eu/broadcast/archive/id/1357 ...so that may explain some of our ongoing problems (both CRAB and WMAgent submissions are via servers at RAL). |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. https://operations-portal.egi.eu/broadcast/archive/id/1357 Does not work.(permissions) Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
CMS task on t4t site are shutting down after 5min. They do 3 runs and quit. EDIT: NO CMS TASK STARTS UP. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
Thanks, Ivan. Go to the home page and look under Latest News. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, m. I allways thought, we are Tier 3. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
Thanks, Ivan.I wasn't sure if it would or not; I included in case it did. Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
Thanks, m. We are (T3_CH_Volunteer), but the Condor/WMAgent server VMs are hosted within the RAL Tier-1 site. To make it more confusing, there's a Tier-2 site at RAL too! (T0 is CERN, T1 is national, T2 is regional, T3 is normally institutional.) |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. [Edit] We've definitely had communications issues over the life of the last batch; Dashboard says there were 9193 successes to date, but I only count 8843 apparently-good result files on the Data Bridge. [/Edit] |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving. # But did they ever return? # No they never returned, # And their fate is still unlearnt. # They may glide forever # 'Round the Internet fibres. # They're the jobs that never returned! https://www.youtube.com/watch?v=Dh994JcEfkI |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
|
Send message Joined: 8 Apr 15 Posts: 783 Credit: 12,682,498 RAC: 9,185 |
Kingston Trio About the same age as NASA........and me........watching NASA on black and white TV |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government. Another famous Kinston Trio song. Coincidentally, that song earwormed me for a day or two recently, with no trigger whatsoever that I recall! |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
BTW, you might have noticed that job rates have fallen in the past few hours. I've not found a reason for that, and I'm about to slope off for a good night's sleep. Investigation will continue about 1000Z tomorrow! [Edit] Ah, I had to look one more time before bed... WMAgent jobs are rising, pre-empting CRAB3 jobs. (I'll go to sleep now, promise!) [/Edit] |
Send message Joined: 8 Apr 15 Posts: 783 Credit: 12,682,498 RAC: 9,185 |
You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government. THAT is a Kingston Trio classic......and I know if I play that song I will still have it in my head when I wake up tomorrow I think that is one of those things that happen to us *oldtimers* Ivan These days I can get them stuck in my head for days and some are ones I would rather not have in my head at all! I even get those from old tv shows I watch these days since I have a channel that shows the old classics from the 50's,60's,and 70's The only way I can get rid of them is play another one from the past that I don't mind.....even after a day or two. Ok now as far as our tasks here.....as usual if I don't watch my hosts 24/7 those "Error while computing" will run that streak over and over every 10 minutes. So when I checked I had a Valid task and then 14 *Error while computing* in a row so I stopped them and did a reboot and got a new task running and set them to NOT get new ones just in case that started again. So far one host got another Valid and once again it gave me a ERROR on the CMS in 10 minutes so since I was watching I took a few minutes to do a Windows 10 update and reboot and tried again and this time got a LHCb ERROR. Since it is set to not get new tasks it won't start another streak (it just happened on that desktop upstairs and I am back in the livingroom on the laptop) and I tried one on here just to see if it would run one since I suspended my Atlas tasks......no luck so it is back off my host list here. My other 2 hosts still have tasks running that I started yesterday so maybe they will finish Valid. They all have looked ok when I watched them start on the VM Console BUT what I have been getting on those ERRORS when I look at the stderr is VM Heartbeat file specified, but missing. VM Heartbeat file specified, but missing file system status. (errno = '2') Not sure if it is just on the server end having trouble connecting the opposite side of the planet or my DSL snail. But then ALL 7 of my hosts run GPU's for Einstein 24/7 and never have a problem and the same with my vLHC tasks and Atlas tasks. I will watch the two I have running since I tend to stay up until 3am and they must be getting close and see if I can get the other two hosts to get back to work without the errors after 10 minutes. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,474 RAC: 243 |
BUT what I have been getting on those ERRORS when I look at the stderr is The bolded kind of message is a local problem from your VM and host, not from a connection going outside your home. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I guess, the previos batch has been abandoned ? There has been next to no change in the past few days and there are still 100+ pending. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
I guess, the previos batch has been abandoned ? Like I said, they are welcome to report in, but my guess is that they have been lost in the communications problems at RAL. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
"unknown" and "WNPostproc" state jobs are rising rapidly. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 1 |
"unknown" and "WNPostproc" state jobs are rising rapidly. I think there's a communications problem somewhere, but I haven't been able to pinpoint it yet. |
©2025 CERN