Message boards :
CMS Application :
WMAgent failures.
Message board moderation
Author | Message |
---|---|
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 223 |
Now that the experts are away... Received some WMAgent jobs for the first time, and as far as I can see, they all fail. From the end of stderr... 2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:26:09 UTC 2016 : WMAgent is finished. The job had an exit code of 0 2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:26:36 UTC 2016 : starting... 2016-04-11 03:56:25 (3516): Guest Log: ..... 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 93774 Apr 11 03:38 logArchive.tar.gz 2016-04-11 03:56:25 (3516): Guest Log: WMTaskSpace/stageOut1: 2016-04-11 03:56:25 (3516): Guest Log: total 12 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 3075 Apr 11 03:38 Report.pkl 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 343 Apr 11 03:26 __init__.py 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 611 Apr 11 03:38 __init__.pyc 2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:39:16 UTC 2016 : WMAgent is finished. The job had an exit code of 0 2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:39:36 UTC 2016 : starting... 2016-04-11 03:56:25 (3516): Guest Log: ..... 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 92959 Apr 11 03:51 logArchive.tar.gz 2016-04-11 03:56:25 (3516): Guest Log: WMTaskSpace/stageOut1: 2016-04-11 03:56:25 (3516): Guest Log: total 12 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 3074 Apr 11 03:51 Report.pkl 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 343 Apr 11 03:39 __init__.py 2016-04-11 03:56:25 (3516): Guest Log: -rw-r--r-- 1 boinc boinc 611 Apr 11 03:51 __init__.pyc 2016-04-11 03:56:25 (3516): Guest Log: WMAgent bootstrap : Mon Apr 11 02:51:58 UTC 2016 : WMAgent is finished. The job had an exit code of 0 2016-04-11 03:56:25 (3516): Guest Log: [INFO] 8 jobs failed out of the last 8 jobs. Shutting down! The full versions from different hosts... http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=137276 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=137392 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=136978 These hosts will have been stopped for 18 hours or so before booting so they would have the previous (non WMAgent) job still loaded. This is far too long to resume so perhaps this abandoned job isn't being completely cleaned up. There are no problems with "non WMAgent" jobs which are running OK as I write. Any ideas, anyone? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I think, that WMAgent jobs cause the task to shut down with jobs marked as fail, because the failure detection is designed to only work on CMS jobs. Correct??? I checked some of the jobs individually and they all pass. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
I think, that WMAgent jobs cause the task to shut down with jobs marked as fail, because the failure detection is designed to only work on CMS jobs. Could be -- a matter of liaison between Laurence and Hassen, perhaps. (I guess you mean CRAB rather than CMS, and yes, some of the detection code I'm familiar with does assume CRAB job termination syntax.) |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I guess you mean CRAB rather than CMS, Yes, i meant the jobs, you submitted, and this project is primarily running. |
Send message Joined: 13 Feb 15 Posts: 1182 Credit: 815,528 RAC: 214 |
I have 2 VM's running and both are showing all boinc owned processes twice like cmsRun, condor_master, condor_starter, condor_exec.exe etc. Can't find in the logs or on dashboard what jobs that should be. |
Send message Joined: 13 Feb 15 Posts: 1182 Credit: 815,528 RAC: 214 |
Also some info in the result now from WMAgent jobs. 2016-04-16 14:10:04 (1284): Guest Log: [INFO] CMS glidein Run 1 ended 2016-04-16 14:10:14 (1284): Guest Log: Log extracts for Run 1 jobs 2016-04-16 14:10:14 (1284): Guest Log: WMAgent bootstrap : Sat Apr 16 09:54:43 UTC 2016 : starting... 2016-04-16 14:10:14 (1284): Guest Log: ..... 2016-04-16 14:10:14 (1284): Guest Log: GLITE_LOCATION=/cvmfs/grid.cern.ch/emi-wn-3.15.3-1_sl6v1/usr 2016-04-16 14:10:14 (1284): Guest Log: GLIDEIN_LOCAL_TMP_DIR=/tmp/glide_boinc_5kS8N0 2016-04-16 14:10:14 (1284): Guest Log: _CONDOR_JOB_IWD=/home/boinc/CMSRun/glide_v96lW1/execute/dir_8056 2016-04-16 14:10:14 (1284): Guest Log: SRM_PATH=/cvmfs/grid.cern.ch/emi-wn-3.15.3-1_sl6v1/usr/share/srm 2016-04-16 14:10:14 (1284): Guest Log: X509_USER_KEY=/home/boinc/CMSRun/glide_v96lW1/hostkey.pem 2016-04-16 14:10:14 (1284): Guest Log: CONDOR_PROCD_ADDRESS=/home/boinc/CMSRun/glide_v96lW1/log/procd_address 2016-04-16 14:10:14 (1284): Guest Log: _=/usr/bin/env 2016-04-16 14:10:14 (1284): Guest Log: WMAgent bootstrap : Sat Apr 16 09:54:47 UTC 2016 : WMAgent is now running the job... 2016-04-16 14:10:14 (1284): Guest Log: [INFO] 1 jobs failed out of the last 1 jobs. Shutting down! 2016-04-16 14:10:14 (1284): VM Completion File Detected. 2016-04-16 14:10:14 (1284): Powering off VM. |
©2024 CERN