Message boards :
CMS Application :
Error rate going up
Message board moderation
Previous · 1 · 2 · 3 · 4
Author | Message |
---|---|
Send message Joined: 15 Apr 15 Posts: 38 Credit: 227,251 RAC: 0 |
Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Yes, there are a number of jobs like that. People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. There doesn't seem to be anything I can do at the moment, except hope it's a transient problem with the proxy server that goes away. Hassen is having even more problems with his WMAgent jobs, which is why there's a failure spike in the displays. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. Well, that's why we have a beta. :-0! Unfortunately, it's a very complicated chain of processes, all of which have to work right. We hope we're eliminating the points of failure, but obviously there are still some out there to bite us. Some of the failures are out of our control, which makes it even more frustrating. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
hope it's a transient problem with the proxy server that goes away. That is very bad. Not to be able to identify the problem and crossing fingers for it to go away. That does not give any confidence. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
hope it's a transient problem with the proxy server that goes away. Remember, I don't run the infrastructure. I have to rely on others, notably CERN and RAL IT. In the event, it may have been transient, we haven't had a logged failure in 80 minutes; the storm started a bit after 1900 GMT and finished just after 2100. That's for CRAB, WMAgent may still have problems. I've been monitoring the CMS mailing lists without seeing others having the same problem, but there isn't one dedicated to cvmfs (I thought there was, I'm sure I see occasional questions about configuration; maybe it's subsumed under another title, or it's actually a CERN list rather than CMS). I'll keep an eye out for a little while longer before hitting the hay, but I have to be up early to video-monitor several hours of presentations from my "other" job at CERN tomorrow. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Remember, I don't run the infrastructure I am aware of that, and i am not blaming you. However, if one has to expect random bursts of errors, this raises questions about the reliability of the system altogether. I think, the system is too big already. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 |
.... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ... Like this, maybe, in the wrapper stderr (F5) although it isn't clear whether the file should be on the local host or the cvmfs end:- Error messages may appear here. grep: /var/lib/condor/execute/dir_4145/jobdata: No such file or directory but it doesn't seem to affect the job, which started and is running OK. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174184 At least, there should be some indication in the log, as of why. Jobs are availabel, so that is not the reason. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
.... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ... No, if it's cvmfs the path is /cvmfs/cms.cern.ch/... or possibly /cvmfs/grid.cern.ch/... for some services. I've seen that message before, as you noted it's more informative than show-stopping. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. We seem to have found the smoking gun for last night's problems -- some squid services were being updated. Here's the traffic graph for our proxy. Yet another graph to check when things seem to be going wrong... ...and now the Dashboard summary graphs are missing the last couple of hours... Jobs are still running, though. [Edit] Aaand, it's back! [/Edit] |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr. I think that's the same problem they see on the HLT farm when they try to use idle machines (e.g. between LHC fills) to do processing. Sometimes Condor just shuts down straight after starting. It's being looked into. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Thanks, Ivan. Could you tell the status of job 2411 and 2424? They have been listed as "running" for nearly 24h, but they finished. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Thanks, Ivan. All indications are that they finished OK. Job log and post-proc both report exit status 0, and the output files are on the data-bridge. Looks like Dashboard missed the finish message for a number of jobs yesterday. [Penny drops] Ah! That'd be why there's a jump in the "jobs running" graphs since yesterday afternoon. All these glitches are probably related. [/Pd] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Looks like these jobs are changing state to "unknown" after 24h. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Looks like these jobs are changing state to "unknown" after 24h. Yes, I guess there's a timeout buried in Dashboard; we've seen this before, they spike up the cumulative wall-time for "failed and aborted" jobs. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Will they remain there? If so, they are not counted as "finished",when the batch ends(even though, they are most likly valid)? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Will they remain there? As far as Dashboard is concerned, yes, but as far as CMS is concerned they should be considered valid result files. Our next step is to automatically get result files off the data bridge, probably with a validation step, and onto a CMS GRID site. I don't think that can be done with my CRAB jobs, as I don't (yet) subscribe them to the central database, but I have moved them manually in the past. Hassen's WMAgent batches are registered, I believe, and it was an action from our latest Dynamic Resources meeting on Wednesday to try to move one of his datasets with a PhEDEx[1] "subscription" to one of our Tier-2 analysis sites. At this point, we're not actually looking for a central validation (as, e.g. S@H does by comparing results) because by the nature of the jobs we are envisaging this will be best done by the scientist requesting what will be non-mainstream simulations ("exotica", which I guess includes long-lived neutral particles, mini-black-holes, etc. rather than the "normal" searches for Supersymmetry, Dark Matter, und so weiter). [1] https://indico.fnal.gov/getFile.py/access?contribId=29&sessionId=36&resId=0&materialId=slides&confId=3586 or gwgl if that's restricted |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
So, in very simple terms: You make a lot of simulations.Then you compare them to real world events. The closer the match, the better the model. Then, if you have a theory about a new mechanism or particle you simulate it. Then you compare this result to the real world and see, if that shows somthing similar or if it is happening at all Is this, very very simplified, what you are doing? |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
So, in very simple terms: That's pretty much it, yes. But also we look for deviations from the model, which can tell us if there is something we've not included in the model, like the hints of a new phenomenon at around 750 GeV that both main experiments saw in the diphoton* channel last year. Getting better experimental statistics on that is one of our main goals this year, while the theorists struggle to find out which of several hundred competing theories might explain it. *diphoton -- literally two photons. If we see two correlated photons we can add up their energy and get the mass of the parent particle they may have decayed from; if we start seeing an excess of these at one particular mass, then we might have found a new particle. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 7 |
Just to put somewhere that the high error rate between 0200 and 0400 UTC this morning was due to problems accessing the site-local-info file. I've just had it confirmed that this was a global problem due to changes in the GIT repository at CERN used to store the SITECONF info, so we weren't the only ones affected. |
©2025 CERN