Thread 'Error rate going up'

Author	Message
rbpeake Send message Joined: 15 Apr 15 Posts: 46 Credit: 1,136,574 RAC: 4,396	Message 3325 - Posted: 12 May 2016, 20:37:49 UTC - in response to Message 3324. Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. ID: 3325 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3326 - Posted: 12 May 2016, 20:44:46 UTC - in response to Message 3324. Yes, there are a number of jobs like that. People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. There doesn't seem to be anything I can do at the moment, except hope it's a transient problem with the proxy server that goes away. Hassen is having even more problems with his WMAgent jobs, which is why there's a failure spike in the displays. ID: 3326 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3327 - Posted: 12 May 2016, 20:48:31 UTC - in response to Message 3325. Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. Well, that's why we have a beta. :-0! Unfortunately, it's a very complicated chain of processes, all of which have to work right. We hope we're eliminating the points of failure, but obviously there are still some out there to bite us. Some of the failures are out of our control, which makes it even more frustrating. ID: 3327 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3328 - Posted: 12 May 2016, 20:55:35 UTC Last modified: 12 May 2016, 20:57:06 UTC hope it's a transient problem with the proxy server that goes away. That is very bad. Not to be able to identify the problem and crossing fingers for it to go away. That does not give any confidence. ID: 3328 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3329 - Posted: 12 May 2016, 22:31:45 UTC - in response to Message 3328. hope it's a transient problem with the proxy server that goes away. That is very bad. Not to be able to identify the problem and crossing fingers for it to go away. That does not give any confidence. Remember, I don't run the infrastructure. I have to rely on others, notably CERN and RAL IT. In the event, it may have been transient, we haven't had a logged failure in 80 minutes; the storm started a bit after 1900 GMT and finished just after 2100. That's for CRAB, WMAgent may still have problems. I've been monitoring the CMS mailing lists without seeing others having the same problem, but there isn't one dedicated to cvmfs (I thought there was, I'm sure I see occasional questions about configuration; maybe it's subsumed under another title, or it's actually a CERN list rather than CMS). I'll keep an eye out for a little while longer before hitting the hay, but I have to be up early to video-monitor several hours of presentations from my "other" job at CERN tomorrow. ID: 3329 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3330 - Posted: 12 May 2016, 22:56:47 UTC Remember, I don't run the infrastructure I am aware of that, and i am not blaming you. However, if one has to expect random bursts of errors, this raises questions about the reliability of the system altogether. I think, the system is too big already. ID: 3330 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 13	Message 3331 - Posted: 13 May 2016, 0:16:21 UTC - in response to Message 3326. .... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ... Like this, maybe, in the wrapper stderr (F5) although it isn't clear whether the file should be on the local host or the cvmfs end:- Error messages may appear here. grep: /var/lib/condor/execute/dir_4145/jobdata: No such file or directory but it doesn't seem to affect the job, which started and is running OK. ID: 3331 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3332 - Posted: 13 May 2016, 9:47:08 UTC Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174184 At least, there should be some indication in the log, as of why. Jobs are availabel, so that is not the reason. ID: 3332 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3339 - Posted: 13 May 2016, 12:47:56 UTC - in response to Message 3331. .... People seem to be having trouble establishing contact with the cvmfs servers so various files come up as "not found" with bad results. ... Like this, maybe, in the wrapper stderr (F5) although it isn't clear whether the file should be on the local host or the cvmfs end:- Error messages may appear here. grep: /var/lib/condor/execute/dir_4145/jobdata: No such file or directory but it doesn't seem to affect the job, which started and is running OK. No, if it's cvmfs the path is /cvmfs/cms.cern.ch/... or possibly /cvmfs/grid.cern.ch/... for some services. I've seen that message before, as you noted it's more informative than show-stopping. ID: 3339 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3340 - Posted: 13 May 2016, 12:51:17 UTC - in response to Message 3327. Last modified: 13 May 2016, 14:09:02 UTC Just curious if this app will require this much hand-holding in the future, or will it be much more resilient and reliable when it gets out of beta? Just seems it has been very sensitive to running off the rails for a long time. Well, that's why we have a beta. :-0! Unfortunately, it's a very complicated chain of processes, all of which have to work right. We hope we're eliminating the points of failure, but obviously there are still some out there to bite us. Some of the failures are out of our control, which makes it even more frustrating. We seem to have found the smoking gun for last night's problems -- some squid services were being updated. Here's the traffic graph for our proxy. Yet another graph to check when things seem to be going wrong... ...and now the Dashboard summary graphs are missing the last couple of hours... Jobs are still running, though. [Edit] Aaand, it's back! [/Edit] ID: 3340 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3342 - Posted: 13 May 2016, 12:57:14 UTC - in response to Message 3332. Maybe you should investigate, why new tasks are ending after a few minutes, without any reason in the stderr. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=174184 At least, there should be some indication in the log, as of why. Jobs are availabel, so that is not the reason. I think that's the same problem they see on the HLT farm when they try to use idle machines (e.g. between LHC fills) to do processing. Sometimes Condor just shuts down straight after starting. It's being looked into. ID: 3342 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3343 - Posted: 13 May 2016, 13:02:11 UTC - in response to Message 3342. Thanks, Ivan. Could you tell the status of job 2411 and 2424? They have been listed as "running" for nearly 24h, but they finished. ID: 3343 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3344 - Posted: 13 May 2016, 14:06:14 UTC - in response to Message 3343. Last modified: 13 May 2016, 14:13:29 UTC Thanks, Ivan. Could you tell the status of job 2411 and 2424? They have been listed as "running" for nearly 24h, but they finished. All indications are that they finished OK. Job log and post-proc both report exit status 0, and the output files are on the data-bridge. Looks like Dashboard missed the finish message for a number of jobs yesterday. [Penny drops] Ah! That'd be why there's a jump in the "jobs running" graphs since yesterday afternoon. All these glitches are probably related. [/Pd] ID: 3344 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3346 - Posted: 13 May 2016, 15:12:41 UTC Last modified: 13 May 2016, 15:13:19 UTC Looks like these jobs are changing state to "unknown" after 24h. ID: 3346 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3349 - Posted: 13 May 2016, 17:15:44 UTC - in response to Message 3346. Last modified: 13 May 2016, 17:17:16 UTC Looks like these jobs are changing state to "unknown" after 24h. Yes, I guess there's a timeout buried in Dashboard; we've seen this before, they spike up the cumulative wall-time for "failed and aborted" jobs. ID: 3349 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3351 - Posted: 13 May 2016, 17:30:34 UTC - in response to Message 3349. Last modified: 13 May 2016, 17:31:46 UTC Will they remain there? If so, they are not counted as "finished",when the batch ends(even though, they are most likly valid)? ID: 3351 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3352 - Posted: 13 May 2016, 21:21:54 UTC - in response to Message 3351. Will they remain there? If so, they are not counted as "finished",when the batch ends(even though, they are most likly valid)? As far as Dashboard is concerned, yes, but as far as CMS is concerned they should be considered valid result files. Our next step is to automatically get result files off the data bridge, probably with a validation step, and onto a CMS GRID site. I don't think that can be done with my CRAB jobs, as I don't (yet) subscribe them to the central database, but I have moved them manually in the past. Hassen's WMAgent batches are registered, I believe, and it was an action from our latest Dynamic Resources meeting on Wednesday to try to move one of his datasets with a PhEDEx[1] "subscription" to one of our Tier-2 analysis sites. At this point, we're not actually looking for a central validation (as, e.g. S@H does by comparing results) because by the nature of the jobs we are envisaging this will be best done by the scientist requesting what will be non-mainstream simulations ("exotica", which I guess includes long-lived neutral particles, mini-black-holes, etc. rather than the "normal" searches for Supersymmetry, Dark Matter, und so weiter). [1] https://indico.fnal.gov/getFile.py/access?contribId=29&sessionId=36&resId=0&materialId=slides&confId=3586 or gwgl if that's restricted ID: 3352 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,215,383 RAC: 0	Message 3353 - Posted: 13 May 2016, 21:35:35 UTC Last modified: 13 May 2016, 21:36:45 UTC So, in very simple terms: You make a lot of simulations.Then you compare them to real world events. The closer the match, the better the model. Then, if you have a theory about a new mechanism or particle you simulate it. Then you compare this result to the real world and see, if that shows somthing similar or if it is happening at all Is this, very very simplified, what you are doing? ID: 3353 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3358 - Posted: 14 May 2016, 9:36:52 UTC - in response to Message 3353. So, in very simple terms: You make a lot of simulations.Then you compare them to real world events. The closer the match, the better the model. Then, if you have a theory about a new mechanism or particle you simulate it. Then you compare this result to the real world and see, if that shows somthing similar or if it is happening at all Is this, very very simplified, what you are doing? That's pretty much it, yes. But also we look for deviations from the model, which can tell us if there is something we've not included in the model, like the hints of a new phenomenon at around 750 GeV that both main experiments saw in the diphoton* channel last year. Getting better experimental statistics on that is one of our main goals this year, while the theorists struggle to find out which of several hundred competing theories might explain it. *diphoton -- literally two photons. If we see two correlated photons we can add up their energy and get the mass of the parent particle they may have decayed from; if we start seeing an excess of these at one particular mass, then we might have found a new particle. ID: 3358 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1152 Credit: 8,310,612 RAC: 0	Message 3527 - Posted: 1 Jun 2016, 14:49:53 UTC Last modified: 1 Jun 2016, 14:53:30 UTC Just to put somewhere that the high error rate between 0200 and 0400 UTC this morning was due to problems accessing the site-local-info file. I've just had it confirmed that this was a global problem due to changes in the GIT repository at CERN used to store the SITECONF info, so we weren't the only ones affected. ID: 3527 · Rating: 0 · rate: / Reply Quote

Development for LHC@home