Message boards : CMS Application : Stageout failures
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 154
Message 4581 - Posted: 22 Dec 2016, 15:29:57 UTC - in response to Message 4580.  

The CMS jobs graph shows rising number of crab jobs, but the errors are staying the same(or even falling).
That is good news.

Has anything changed?

It's just recovery from when we lost Condor on Tuesday night, I think. The CRAB jobs are going just to LHC@Home now (-dev and Laurence's cluster are munching on the WMAgent jobs, which show up as "unknown" for some unknown reason). Many hosts would have quota-ed out due to continued failure, so they will be gradually getting their quota back, while those that fell back to Test4Theory wouldn't return to us for the 18-20 hours their tasks run. We're only now back up to the 750-800 running CRAB jobs that we had before the disturbance.
ID: 4581 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4582 - Posted: 23 Dec 2016, 10:51:11 UTC
Last modified: 23 Dec 2016, 10:55:02 UTC

Tasks have stopped (no cpu activity) and the following in the stderr.log:
CMS jobs graphs not accessible,also.

INFO:root:Beginning report processing for step logArch1
ERROR:root:Cannot find report for step logArch1 in space /var/lib/condor/execute/dir_10487/job/WMTaskSpace/logArch1
INFO:root:Beginning report processing for step cmsRun1
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 2: BASHOPTS: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 13: EUID: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 25: PPID: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 29: SHELLOPTS: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 36: UID: readonly variable
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
gfal-copy error: 13 (Permission denied) - TRANSFER Authentication error, reached maximum number of attempts
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 2: BASHOPTS: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 13: EUID: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 25: PPID: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 29: SHELLOPTS: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 36: UID: readonly variable
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
ID: 4582 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,874,101
RAC: 154
Message 4583 - Posted: 23 Dec 2016, 12:09:05 UTC - in response to Message 4582.  
Last modified: 23 Dec 2016, 12:09:42 UTC

Tasks have stopped (no cpu activity) and the following in the stderr.log:
CMS jobs graphs not accessible,also.

INFO:root:Beginning report processing for step logArch1
ERROR:root:Cannot find report for step logArch1 in space /var/lib/condor/execute/dir_10487/job/WMTaskSpace/logArch1
INFO:root:Beginning report processing for step cmsRun1
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 2: BASHOPTS: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 13: EUID: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 25: PPID: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 29: SHELLOPTS: readonly variable
/var/lib/condor/execute/dir_10487/startup_environment.sh: line 36: UID: readonly variable
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found
gfal-copy error: 13 (Permission denied) - TRANSFER Authentication error, reached maximum number of attempts
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 2: BASHOPTS: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 9: BASH_VERSINFO: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 13: EUID: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 25: PPID: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 29: SHELLOPTS: readonly variable
/var/lib/condor/execute/dir_9939/startup_environment.sh: line 36: UID: readonly variable
***ERROR*** in propagateError, *newErr is not NULL impossible to overwrite ... old error wasHTTP 404 : File not found


Things look OK here at the moment, tho' there seems to have been a glitch at LHC@Home overnight, which appears to have led to a spike of "echo" failures once the problem cleared. I'll keep checking for a while, but I discovered when I came in this morning that the University is actually closed today, so I'll probably declare it POET'S Day[1] soon.

[1] Traditional Aussie approach to any Friday, not just the one before Christmas -- P*ss Off Early, Tomorrow's Saturday.
ID: 4583 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3

Message boards : CMS Application : Stageout failures


©2024 CERN