Message boards : CMS Application : Error rate going up
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4

AuthorMessage
rbpeake

Send message
Joined: 15 Apr 15
Posts: 38
Credit: 227,251
RAC: 0
Message 3325 - Posted: 12 May 2016, 20:37:49 UTC - in response to Message 3324.  

Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time.
ID: 3325 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3326 - Posted: 12 May 2016, 20:44:46 UTC - in response to Message 3324.  

Yes, there are a number of jobs like that. People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. There doesn't seem to be anything I can do at the moment, except hope it's a transient problem with the proxy server that goes away.
Hassen is having even more problems with his WMAgent jobs, which is why there's a failure spike in the displays.
ID: 3326 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3327 - Posted: 12 May 2016, 20:48:31 UTC - in response to Message 3325.  

Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time.

Well, that's why we have a beta. :-0! Unfortunately, it's a very complicated chain of processes, all of which have to work right. We hope we're eliminating the points of failure, but obviously there are still some out there to bite us. Some of the failures are out of our control, which makes it even more frustrating.
ID: 3327 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3328 - Posted: 12 May 2016, 20:55:35 UTC
Last modified: 12 May 2016, 20:57:06 UTC

hope it's a transient problem with the proxy server that goes away.


That is very bad. Not to be able to identify the problem and crossing fingers for it to go away.

That does not give any confidence.
ID: 3328 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3329 - Posted: 12 May 2016, 22:31:45 UTC - in response to Message 3328.  

hope it's a transient problem with the proxy server that goes away.


That is very bad. Not to be able to identify the problem and crossing fingers for it to go away.

That does not give any confidence.


Remember, I don't run the infrastructure. I have to rely on others, notably CERN and RAL IT. In the event, it may have been transient, we haven't had a logged failure in 80 minutes; the storm started a bit after 1900 GMT and finished just after 2100. That's for CRAB, WMAgent may still have problems. I've been monitoring the CMS mailing lists without seeing others having the same problem, but there isn't one dedicated to cvmfs (I thought there was, I'm sure I see occasional questions about configuration; maybe it's subsumed under another title, or it's actually a CERN list rather than CMS).
I'll keep an eye out for a little while longer before hitting the hay, but I have to be up early to video-monitor several hours of presentations from my "other" job at CERN tomorrow.
ID: 3329 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3330 - Posted: 12 May 2016, 22:56:47 UTC

Remember, I don't run the infrastructure


I am aware of that, and i am not blaming you.

However, if one has to expect random bursts of errors, this raises questions about the reliability of the system altogether.

I think, the system is too big already.
ID: 3330 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 268
Message 3331 - Posted: 13 May 2016, 0:16:21 UTC - in response to Message 3326.  

.... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ...


Like this, maybe, in the wrapper stderr (F5) although it isn't clear whether the file should be on the local host or the cvmfs end:-

Error messages may appear here.
grep: /var/lib/condor/execute/dir_4145/jobdata: No such file or directory

but it doesn't seem to affect the job, which started and is running OK.
ID: 3331 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3332 - Posted: 13 May 2016, 9:47:08 UTC

Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174184

At least, there should be some indication in the log, as of why.
Jobs are availabel, so that is not the reason.
ID: 3332 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3339 - Posted: 13 May 2016, 12:47:56 UTC - in response to Message 3331.  

.... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ...


Like this, maybe, in the wrapper stderr (F5) although it isn't clear whether the file should be on the local host or the cvmfs end:-

Error messages may appear here.
grep: /var/lib/condor/execute/dir_4145/jobdata: No such file or directory

but it doesn't seem to affect the job, which started and is running OK.

No, if it's cvmfs the path is /cvmfs/cms.cern.ch/... or possibly /cvmfs/grid.cern.ch/... for some services. I've seen that message before, as you noted it's more informative than show-stopping.
ID: 3339 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3340 - Posted: 13 May 2016, 12:51:17 UTC - in response to Message 3327.  
Last modified: 13 May 2016, 14:09:02 UTC

Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time.

Well, that's why we have a beta. :-0! Unfortunately, it's a very complicated chain of processes, all of which have to work right. We hope we're eliminating the points of failure, but obviously there are still some out there to bite us. Some of the failures are out of our control, which makes it even more frustrating.

We seem to have found the smoking gun for last night's problems -- some squid services were being updated. Here's the traffic graph for our proxy. Yet another graph to check when things seem to be going wrong...

...and now the Dashboard summary graphs are missing the last couple of hours... Jobs are still running, though.
[Edit] Aaand, it's back! [/Edit]
ID: 3340 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3342 - Posted: 13 May 2016, 12:57:14 UTC - in response to Message 3332.  

Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174184

At least, there should be some indication in the log, as of why.
Jobs are availabel, so that is not the reason.

I think that's the same problem they see on the HLT farm when they try to use idle machines (e.g. between LHC fills) to do processing. Sometimes Condor just shuts down straight after starting. It's being looked into.
ID: 3342 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3343 - Posted: 13 May 2016, 13:02:11 UTC - in response to Message 3342.  

Thanks, Ivan.
Could you tell the status of job 2411 and 2424?
They have been listed as "running" for nearly 24h, but they finished.
ID: 3343 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3344 - Posted: 13 May 2016, 14:06:14 UTC - in response to Message 3343.  
Last modified: 13 May 2016, 14:13:29 UTC

Thanks, Ivan.
Could you tell the status of job 2411 and 2424?
They have been listed as "running" for nearly 24h, but they finished.

All indications are that they finished OK. Job log and post-proc both report exit status 0, and the output files are on the data-bridge. Looks like Dashboard missed the finish message for a number of jobs yesterday.

[Penny drops] Ah! That'd be why there's a jump in the "jobs running" graphs since yesterday afternoon. All these glitches are probably related. [/Pd]
ID: 3344 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3346 - Posted: 13 May 2016, 15:12:41 UTC
Last modified: 13 May 2016, 15:13:19 UTC

Looks like these jobs are changing state to "unknown" after 24h.
ID: 3346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3349 - Posted: 13 May 2016, 17:15:44 UTC - in response to Message 3346.  
Last modified: 13 May 2016, 17:17:16 UTC

Looks like these jobs are changing state to "unknown" after 24h.

Yes, I guess there's a timeout buried in Dashboard; we've seen this before, they spike up the cumulative wall-time for "failed and aborted" jobs.
ID: 3349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3351 - Posted: 13 May 2016, 17:30:34 UTC - in response to Message 3349.  
Last modified: 13 May 2016, 17:31:46 UTC

Will they remain there?
If so, they are not counted as "finished",when the batch ends(even though, they are most likly valid)?
ID: 3351 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3352 - Posted: 13 May 2016, 21:21:54 UTC - in response to Message 3351.  

Will they remain there?
If so, they are not counted as "finished",when the batch ends(even though, they are most likly valid)?

As far as Dashboard is concerned, yes, but as far as CMS is concerned they should be considered valid result files. Our next step is to automatically get result files off the data bridge, probably with a validation step, and onto a CMS GRID site. I don't think that can be done with my CRAB jobs, as I don't (yet) subscribe them to the central database, but I have moved them manually in the past. Hassen's WMAgent batches are registered, I believe, and it was an action from our latest Dynamic Resources meeting on Wednesday to try to move one of his datasets with a PhEDEx[1] "subscription" to one of our Tier-2 analysis sites.

At this point, we're not actually looking for a central validation (as, e.g. S@H does by comparing results) because by the nature of the jobs we are envisaging this will be best done by the scientist requesting what will be non-mainstream simulations ("exotica", which I guess includes long-lived neutral particles, mini-black-holes, etc. rather than the "normal" searches for Supersymmetry, Dark Matter, und so weiter).

[1] https://indico.fnal.gov/getFile.py/access?contribId=29&sessionId=36&resId=0&materialId=slides&confId=3586 or gwgl if that's restricted
ID: 3352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3353 - Posted: 13 May 2016, 21:35:35 UTC
Last modified: 13 May 2016, 21:36:45 UTC

So, in very simple terms:
You make a lot of simulations.Then you compare them to real world events.
The closer the match, the better the model.

Then, if you have a theory about a new mechanism or particle you simulate it.
Then you compare this result to the real world and see, if that shows somthing similar or if it is happening at all

Is this, very very simplified, what you are doing?
ID: 3353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3358 - Posted: 14 May 2016, 9:36:52 UTC - in response to Message 3353.  

So, in very simple terms:
You make a lot of simulations.Then you compare them to real world events.
The closer the match, the better the model.

Then, if you have a theory about a new mechanism or particle you simulate it.
Then you compare this result to the real world and see, if that shows somthing similar or if it is happening at all

Is this, very very simplified, what you are doing?

That's pretty much it, yes. But also we look for deviations from the model, which can tell us if there is something we've not included in the model, like the hints of a new phenomenon at around 750 GeV that both main experiments saw in the diphoton* channel last year. Getting better experimental statistics on that is one of our main goals this year, while the theorists struggle to find out which of several hundred competing theories might explain it.

*diphoton -- literally two photons. If we see two correlated photons we can add up their energy and get the mass of the parent particle they may have decayed from; if we start seeing an excess of these at one particular mass, then we might have found a new particle.
ID: 3358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,541
RAC: 270
Message 3527 - Posted: 1 Jun 2016, 14:49:53 UTC
Last modified: 1 Jun 2016, 14:53:30 UTC

Just to put somewhere that the high error rate between 0200 and 0400 UTC this morning was due to problems accessing the site-local-info file. I've just had it confirmed that this was a global problem due to changes in the GIT repository at CERN used to store the SITECONF info, so we weren't the only ones affected.
ID: 3527 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4

Message boards : CMS Application : Error rate going up


©2024 CERN