Message boards : CMS Application : WMAgent failures.
Message board moderation

To post messages, you must log in.

AuthorMessage
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 223
Message 2667 - Posted: 12 Apr 2016, 0:44:14 UTC
Last modified: 12 Apr 2016, 0:47:44 UTC

Now that the experts are away...

Received some WMAgent jobs for the first time, and as far as I can see, they all fail.

From the end of stderr...

2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:26:09 UTC 2016 : WMAgent is finished. The job had an exit code of 0
2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:26:36 UTC 2016 : starting...
2016-04-11 03:56:25 (3516): Guest Log: .....
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 93774 Apr 11 03:38 logArchive.tar.gz
2016-04-11 03:56:25 (3516): Guest Log: WMTaskSpace/stageOut1:
2016-04-11 03:56:25 (3516): Guest Log: total 12
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 3075 Apr 11 03:38 Report.pkl
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 343 Apr 11 03:26 __init__.py
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 611 Apr 11 03:38 __init__.pyc
2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:39:16 UTC 2016 : WMAgent is finished. The job had an exit code of 0
2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:39:36 UTC 2016 : starting...
2016-04-11 03:56:25 (3516): Guest Log: .....
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 92959 Apr 11 03:51 logArchive.tar.gz
2016-04-11 03:56:25 (3516): Guest Log: WMTaskSpace/stageOut1:
2016-04-11 03:56:25 (3516): Guest Log: total 12
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 3074 Apr 11 03:51 Report.pkl
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 343 Apr 11 03:39 __init__.py
2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 611 Apr 11 03:51 __init__.pyc
2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:51:58 UTC 2016 : WMAgent is finished. The job had an exit code of 0
2016-04-11 03:56:25 (3516): Guest Log: [INFO] 8 jobs failed out of the last 8 jobs. Shutting down!


The full versions from different hosts...

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=137276

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=137392

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=136978

These hosts will have been stopped for 18 hours or so before booting so they would have the previous (non WMAgent) job still loaded. This is far too long to resume so perhaps this abandoned job isn't being completely cleaned up. There are no problems with "non WMAgent" jobs which are running OK as I write.

Any ideas, anyone?
ID: 2667 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2784 - Posted: 15 Apr 2016, 21:43:19 UTC
Last modified: 15 Apr 2016, 21:45:21 UTC

I think, that WMAgent jobs cause the task to shut down with jobs marked as fail, because the failure detection is designed to only work on CMS jobs.
Correct???

I checked some of the jobs individually and they all pass.
ID: 2784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,876,718
RAC: 266
Message 2786 - Posted: 15 Apr 2016, 21:49:55 UTC - in response to Message 2784.  

I think, that WMAgent jobs cause the task to shut down with jobs marked as fail, because the failure detection is designed to only work on CMS jobs.
Correct???

I checked some of the jobs individually and they all pass.

Could be -- a matter of liaison between Laurence and Hassen, perhaps. (I guess you mean CRAB rather than CMS, and yes, some of the detection code I'm familiar with does assume CRAB job termination syntax.)
ID: 2786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2787 - Posted: 15 Apr 2016, 21:53:42 UTC

I guess you mean CRAB rather than CMS,


Yes, i meant the jobs, you submitted, and this project is primarily running.
ID: 2787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 2791 - Posted: 16 Apr 2016, 6:29:56 UTC
Last modified: 16 Apr 2016, 6:35:06 UTC

I have 2 VM's running and both are showing all boinc owned processes twice like cmsRun, condor_master, condor_starter, condor_exec.exe etc.

Can't find in the logs or on dashboard what jobs that should be.
ID: 2791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1182
Credit: 815,528
RAC: 214
Message 2802 - Posted: 16 Apr 2016, 15:03:39 UTC

Also some info in the result now from WMAgent jobs.


2016-04-16 14:10:04 (1284): Guest Log: [INFO] CMS glidein Run 1 ended
2016-04-16 14:10:14 (1284): Guest Log: Log extracts for Run 1 jobs
2016-04-16 14:10:14 (1284): Guest Log: WMAgent bootstrap : Sat Apr 16 09:54:43 UTC 2016 : starting...
2016-04-16 14:10:14 (1284): Guest Log: .....
2016-04-16 14:10:14 (1284): Guest Log: GLITE_LOCATION=/cvmfs/grid.cern.ch/emi-wn-3.15.3-1_sl6v1/usr
2016-04-16 14:10:14 (1284): Guest Log: GLIDEIN_LOCAL_TMP_DIR=/tmp/glide_boinc_5kS8N0
2016-04-16 14:10:14 (1284): Guest Log: _CONDOR_JOB_IWD=/home/boinc/CMSRun/glide_v96lW1/execute/dir_8056
2016-04-16 14:10:14 (1284): Guest Log: SRM_PATH=/cvmfs/grid.cern.ch/emi-wn-3.15.3-1_sl6v1/usr/share/srm
2016-04-16 14:10:14 (1284): Guest Log: X509_USER_KEY=/home/boinc/CMSRun/glide_v96lW1/hostkey.pem
2016-04-16 14:10:14 (1284): Guest Log: CONDOR_PROCD_ADDRESS=/home/boinc/CMSRun/glide_v96lW1/log/procd_address
2016-04-16 14:10:14 (1284): Guest Log: _=/usr/bin/env
2016-04-16 14:10:14 (1284): Guest Log: WMAgent bootstrap : Sat Apr 16 09:54:47 UTC 2016 : WMAgent is now running the job...
2016-04-16 14:10:14 (1284): Guest Log: [INFO] 1 jobs failed out of the last 1 jobs. Shutting down!
2016-04-16 14:10:14 (1284): VM Completion File Detected.
2016-04-16 14:10:14 (1284): Powering off VM.
ID: 2802 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : CMS Application : WMAgent failures.


©2024 CERN