Message boards : CMS Application : Error rate going up
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2795 - Posted: 16 Apr 2016, 9:58:51 UTC - in response to Message 2785.  

All of the currently 5 fails are from the same host.

Familiar IP... PM sent.

Machine taken off-line.
ID: 2795 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2796 - Posted: 16 Apr 2016, 10:02:59 UTC - in response to Message 2790.  
Last modified: 16 Apr 2016, 10:10:00 UTC

Probably the best place to post this:

We seem to have lost contact with the Condor server which doles out CRAB3 jobs:
lcggwms02.gridpp.rl.ac.uk
I can't ping it, log in to it or get its statistics on the "CMS Jobs" page. I've e-mailed RAL but don't expect a response at this time of (Friday) night.

Ignore that; must have been a transient communications problem rather than a server problem. Contact re-established.

Reply from RAL:
There was a network "glitch" last night affecting everything in the RAL Tier-1. I don't know yet what caused it.

...and the official announcement:
Date: Sat, 16 Apr 2016 09:53:44 +0100
From: EGI BROADCAST
Subject: [ EGI BROADCAST ] Network problems at RAL Tier1

We have experienced a series of network outages overnight, affecting the
Tier 1 and other RAL based services. A member of the network team is
on-site. There is no time to fix yet.

link to this broadcast :
https://operations-portal.egi.eu/broadcast/archive/id/1357


...so that may explain some of our ongoing problems (both CRAB and WMAgent submissions are via servers at RAL).
ID: 2796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2797 - Posted: 16 Apr 2016, 10:33:21 UTC
Last modified: 16 Apr 2016, 10:34:17 UTC

Thanks, Ivan.

https://operations-portal.egi.eu/broadcast/archive/id/1357


Does not work.(permissions)

Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving.
ID: 2797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2798 - Posted: 16 Apr 2016, 10:56:47 UTC
Last modified: 16 Apr 2016, 11:30:24 UTC

CMS task on t4t site are shutting down after 5min.
They do 3 runs and quit.

EDIT: NO CMS TASK STARTS UP.
ID: 2798 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 268
Message 2799 - Posted: 16 Apr 2016, 13:21:02 UTC - in response to Message 2797.  

Thanks, Ivan.

https://operations-portal.egi.eu/broadcast/archive/id/1357


Does not work.(permissions)


Go to the home page and look under Latest News.
ID: 2799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2800 - Posted: 16 Apr 2016, 13:34:26 UTC - in response to Message 2799.  

Thanks, m.

I allways thought, we are Tier 3.
ID: 2800 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2803 - Posted: 16 Apr 2016, 15:18:29 UTC - in response to Message 2797.  

Thanks, Ivan.

https://operations-portal.egi.eu/broadcast/archive/id/1357

Does not work.(permissions)
I wasn't sure if it would or not; I included in case it did.

Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving.

I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-)
ID: 2803 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2804 - Posted: 16 Apr 2016, 15:23:29 UTC - in response to Message 2800.  

Thanks, m.

I allways thought, we are Tier 3.

We are (T3_CH_Volunteer), but the Condor/WMAgent server VMs are hosted within the RAL Tier-1 site. To make it more confusing, there's a Tier-2 site at RAL too! (T0 is CERN, T1 is national, T2 is regional, T3 is normally institutional.)
ID: 2804 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2805 - Posted: 16 Apr 2016, 17:33:07 UTC - in response to Message 2803.  
Last modified: 16 Apr 2016, 17:35:33 UTC

Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving.

I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-)

[Edit] We've definitely had communications issues over the life of the last batch; Dashboard says there were 9193 successes to date, but I only count 8843 apparently-good result files on the Data Bridge. [/Edit]
ID: 2805 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2806 - Posted: 16 Apr 2016, 19:08:41 UTC - in response to Message 2805.  

Any news on the large number of WNPostproc and unknown state jobs of the previous batch? It is not improving.

I don't think it will, overmuch. I just renewed the proxy for another 8 days, as it was coming up to the seven-day lifetime, so the black sheep have that long to come home. :-)

[Edit] We've definitely had communications issues over the life of the last batch; Dashboard says there were 9193 successes to date, but I only count 8843 apparently-good result files on the Data Bridge. [/Edit]

# But did they ever return?
# No they never returned,
# And their fate is still unlearnt.
# They may glide forever
# 'Round the Internet fibres.
# They're the jobs that never returned!


https://www.youtube.com/watch?v=Dh994JcEfkI
ID: 2806 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2807 - Posted: 16 Apr 2016, 19:20:47 UTC - in response to Message 2806.  

My guess is, they went where lost socks go.

http://uncyclopedia.wikia.com/wiki/Missing_socks
ID: 2807 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 751
Credit: 11,609,314
RAC: 1,490
Message 2808 - Posted: 16 Apr 2016, 20:15:43 UTC - in response to Message 2806.  


# But did they ever return?
# No they never returned,
# And their fate is still unlearnt.
# They may glide forever
# 'Round the Internet fibres.
# They're the jobs that never returned!


https://www.youtube.com/watch?v=Dh994JcEfkI


Kingston Trio

About the same age as NASA........and me........watching NASA on black and white TV
ID: 2808 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2809 - Posted: 16 Apr 2016, 22:08:31 UTC - in response to Message 2808.  
Last modified: 16 Apr 2016, 22:11:20 UTC

You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government.
Another famous Kinston Trio song. Coincidentally, that song earwormed me for a day or two recently, with no trigger whatsoever that I recall!
ID: 2809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2810 - Posted: 16 Apr 2016, 22:15:13 UTC
Last modified: 16 Apr 2016, 22:21:31 UTC

BTW, you might have noticed that job rates have fallen in the past few hours. I've not found a reason for that, and I'm about to slope off for a good night's sleep. Investigation will continue about 1000Z tomorrow!
[Edit] Ah, I had to look one more time before bed... WMAgent jobs are rising, pre-empting CRAB3 jobs. (I'll go to sleep now, promise!) [/Edit]
ID: 2810 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 751
Credit: 11,609,314
RAC: 1,490
Message 2811 - Posted: 17 Apr 2016, 5:34:45 UTC - in response to Message 2809.  

You must be the same age as me -- conscripted to fight in Vietnam, reprieved by a change in government.
Another famous Kingston Trio song. Coincidentally, that song earwormed me for a day or two recently, with no trigger whatsoever that I recall!


THAT is a Kingston Trio classic......and I know if I play that song I will still have it in my head when I wake up tomorrow

I think that is one of those things that happen to us *oldtimers* Ivan

These days I can get them stuck in my head for days and some are ones I would rather not have in my head at all!

I even get those from old tv shows I watch these days since I have a channel that shows the old classics from the 50's,60's,and 70's

The only way I can get rid of them is play another one from the past that I don't mind.....even after a day or two.

Ok now as far as our tasks here.....as usual if I don't watch my hosts 24/7 those "Error while computing" will run that streak over and over every 10 minutes.

So when I checked I had a Valid task and then 14 *Error while computing* in a row so I stopped them and did a reboot and got a new task running and set them to NOT get new ones just in case that started again.

So far one host got another Valid and once again it gave me a ERROR on the CMS in 10 minutes so since I was watching I took a few minutes to do a Windows 10 update and reboot and tried again and this time got a LHCb ERROR.

Since it is set to not get new tasks it won't start another streak (it just happened on that desktop upstairs and I am back in the livingroom on the laptop) and I tried one on here just to see if it would run one since I suspended my Atlas tasks......no luck so it is back off my host list here.

My other 2 hosts still have tasks running that I started yesterday so maybe they will finish Valid.

They all have looked ok when I watched them start on the VM Console

BUT what I have been getting on those ERRORS when I look at the stderr is

VM Heartbeat file specified, but missing.
VM Heartbeat file specified, but missing file system status. (errno = '2')


Not sure if it is just on the server end having trouble connecting the opposite side of the planet or my DSL snail.

But then ALL 7 of my hosts run GPU's for Einstein 24/7 and never have a problem and the same with my vLHC tasks and Atlas tasks.

I will watch the two I have running since I tend to stay up until 3am and they must be getting close and see if I can get the other two hosts to get back to work without the errors after 10 minutes.
ID: 2811 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1180
Credit: 815,336
RAC: 238
Message 2812 - Posted: 17 Apr 2016, 9:50:56 UTC - in response to Message 2811.  

BUT what I have been getting on those ERRORS when I look at the stderr is

VM Heartbeat file specified, but missing.
VM Heartbeat file specified, but missing file system status. (errno = '2')


Not sure if it is just on the server end having trouble connecting the opposite side of the planet or my DSL snail.

The bolded kind of message is a local problem from your VM and host, not from a connection going outside your home.
ID: 2812 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2854 - Posted: 19 Apr 2016, 14:04:56 UTC

I guess, the previos batch has been abandoned ?
There has been next to no change in the past few days and there are still 100+ pending.
ID: 2854 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2855 - Posted: 19 Apr 2016, 14:11:12 UTC - in response to Message 2854.  

I guess, the previos batch has been abandoned ?
There has been next to no change in the past few days and there are still 100+ pending.

Like I said, they are welcome to report in, but my guess is that they have been lost in the communications problems at RAL.
ID: 2855 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2926 - Posted: 22 Apr 2016, 8:13:52 UTC

"unknown" and "WNPostproc" state jobs are rising rapidly.
ID: 2926 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 2929 - Posted: 22 Apr 2016, 9:48:58 UTC - in response to Message 2926.  

"unknown" and "WNPostproc" state jobs are rising rapidly.

I think there's a communications problem somewhere, but I haven't been able to pinpoint it yet.
ID: 2929 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : CMS Application : Error rate going up


©2024 CERN