Thread 'Error Codes'

Author	Message
Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3127 - Posted: 30 Apr 2016, 17:51:59 UTC Last modified: 4 May 2016, 8:03:03 UTC At the beginning of next week work will start on providing consistent error codes and behaviour for all applications. Three error codes will be used: EXIT_INIT_FAILURE (When an error is detected on contextualizing the VM or setting up the job environment) EXIT_NO_JOBS (When the job queues are empty) EXIT_JOB_FAILURE (When an error is detected that caused all jobs to fail or all jobs have failed) Any of these errors should cause the BOINC client to back-off. If you see any errors and one of these codes is not used, please let us know. EDIT: To get this into the upstream release the codes have changed to: EXIT_NO_SUB_TASKS EXIT_TASK_FAILURE ID: 3127 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,486 RAC: 56	Message 3168 - Posted: 2 May 2016, 16:44:23 UTC - in response to Message 3127. Not sure where those error codes should appear. 2 tasks stopped prematurely, maybe because of lack of jobs. None of the error codes found. 05/02/16 18:13:27 Process exited, pid=2474, status=0 05/02/16 18:13:27 About to exec Post script: /var/lib/condor/execute/dir_2470/tarOutput.sh 2016-556008-224 05/02/16 18:13:27 Create_Process succeeded, pid=26865 05/02/16 18:13:28 Process exited, pid=26865, status=0 05/02/16 18:13:28 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:13:28 Buf::write(): condor_write() failed 05/02/16 18:13:28 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:13:28 Buf::write(): condor_write() failed 05/02/16 18:13:28 Failed to send job exit status to shadow 05/02/16 18:13:28 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 05/02/16 18:30:52 Got SIGQUIT. Performing fast shutdown. 05/02/16 18:30:52 ShutdownFast all jobs. 05/02/16 18:30:52 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:30:52 Buf::write(): condor_write() failed 05/02/16 18:30:52 Failed to send job exit status to shadow 05/02/16 18:30:52 JobExit() failed, waiting for job lease to expire or for a reconnect attempt ID: 3168 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3170 - Posted: 2 May 2016, 19:07:31 UTC - in response to Message 3168. Take a look at your last two tasks that both started around 11:30. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168396 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168324 First notice the new logging entries. :) 2016-05-02 11:36:31 (5272): Guest Log: [INFO] New Job Starting 2016-05-02 11:36:41 (5272): Guest Log: [INFO] MCPlots JobID: 29942769 2016-05-02 12:45:41 (5272): Guest Log: [INFO] Job Finished It seems that around 17:45 the tasks were paused and resumed after a few minutes. Task 168324 seemed to resume ok and the shut down normally. 2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished 2016-05-02 18:37:48 (15008): Guest Log: [INFO] Condor exited with 0 2016-05-02 18:37:48 (15008): Guest Log: [INFO] Shutting Down. Whereas task 168396, did not show that the job had finished. 2016-05-02 18:35:41 (9356): Guest Log: [INFO] Condor exited with 0 2016-05-02 18:35:41 (9356): Guest Log: [INFO] Shutting Down. 11:30 - 18:30 is only 7 hours so I would have thought they would run for longer. It will be interesting to see how your future tasks go. ID: 3170 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,486 RAC: 56	Message 3173 - Posted: 2 May 2016, 20:01:17 UTC - in response to Message 3170. Last modified: 2 May 2016, 20:36:52 UTC First notice the new logging entries. :) Nice! It seems that around 17:45 the tasks were paused and resumed after a few minutes. BOINC's version change was the reason: 02 May 17:48:25 Version change (7.6.29 -> 7.6.32) Not only paused, but also saved to disk. 11:30 - 18:30 is only 7 hours so I would have thought they would run for longer. During task 168324 no new job started. It was just idling too long: 2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished 2016-05-02 18:37:48 (15008): Guest Log: [INFO] Condor exited with 0 But during task 168396 last MCPlots JobID and Job Finished is missing. ID: 3173 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3176 - Posted: 2 May 2016, 20:15:22 UTC Last modified: 2 May 2016, 20:23:39 UTC 2016-05-02 13:47:48 (1008): Guest Log: [INFO] New Job Starting 2016-05-02 13:47:59 (1008): Guest Log: [INFO] MCPlots JobID: 29924967 2016-05-02 14:49:48 (1008): Guest Log: [INFO] Job Finished Nice feature! ~~Could you add a time stamp to it?~~Never mind. ID: 3176 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3179 - Posted: 2 May 2016, 20:25:41 UTC - in response to Message 3176. It was a request from Ben. ID: 3179 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,486 RAC: 56	Message 3184 - Posted: 3 May 2016, 6:58:50 UTC - in response to Message 3170. It will be interesting to see how your future tasks go. 2 next results had an elapsed time of 12.5 and 13.5 hours and a normal finish. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168814 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169004 Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? What's the meaning of the number e.g.: Guest Log: 287436 I had 2 tasks with the new error code: 206 (0xce) EXIT_INIT_FAILURE (server or network problems) btw Typo in stderr.log: Guest Log: [ERROR] Cloud not get an x509 credential ID: 3184 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3187 - Posted: 3 May 2016, 7:55:34 UTC Same questions: What's the meaning of the number e.g.: Guest Log: 287436 Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? What is the meaning of "Condor JobID: 1", if it is the same for all jobs? 2016-05-03 02:37:15 (3700): Guest Log: [INFO] Job Finished 2016-05-03 02:37:45 (3700): Guest Log: [INFO] New Job Starting 2016-05-03 02:37:45 (3700): Guest Log: [INFO] Condor JobID: 1 2016-05-03 02:37:45 (3700): Guest Log: 286286 2016-05-03 02:37:55 (3700): Guest Log: [INFO] MCPlots JobID: 29930843 2016-05-03 03:17:14 (3700): Guest Log: [INFO] Job Finished 2016-05-03 03:17:14 (3700): Guest Log: [INFO] New Job Starting 2016-05-03 03:17:14 (3700): Guest Log: [INFO] Condor JobID: 1 2016-05-03 03:17:14 (3700): Guest Log: 287333 2016-05-03 03:17:24 (3700): Guest Log: [INFO] MCPlots JobID: 29971685 2016-05-03 03:29:45 (3700): Guest Log: [INFO] Job Finished 2016-05-03 03:30:36 (3700): Guest Log: [INFO] New Job Starting 2016-05-03 03:30:36 (3700): Guest Log: [INFO] Condor JobID: 1 2016-05-03 03:30:36 (3700): Guest Log: 287343 2016-05-03 03:30:46 (3700): Guest Log: [INFO] MCPlots JobID: 2997 ID: 3187 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3190 - Posted: 3 May 2016, 9:11:38 UTC - in response to Message 3184. Last modified: 3 May 2016, 9:12:23 UTC Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? After 12 hours, it will not get new jobs. Automatic shutdown occurs if the no job has been received within 5 mins. This could be optimized but good enough for now. What's the meaning of the number e.g.: Guest Log: 287436 It is the CondorJobID I had 2 tasks with the new error code: 206 (0xce) EXIT_INIT_FAILURE (server or network problems) Perfect! Working as designed. I don't know why it temporarily failed though. btw Typo in stderr.log: Guest Log: [ERROR] Cloud not get an x509 credential Thanks. ID: 3190 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3191 - Posted: 3 May 2016, 9:14:08 UTC - in response to Message 3187. Last modified: 3 May 2016, 9:18:12 UTC 2016-05-03 02:37:45 (3700): Guest Log: [INFO] Condor JobID: 1 2016-05-03 02:37:45 (3700): Guest Log: 286286 This should be the same line. Need to investigate EDIT: Fixed published, available in about one hour. ID: 3191 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,486 RAC: 56	Message 3228 - Posted: 4 May 2016, 6:35:04 UTC - in response to Message 3127. * EXIT_NO_JOBS (When the job queues are empty) After an initial start of a new task, the VM didn't get jobs. Should I not have seen above error code? http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169237 I was able to catch the StartLog before the VM was killed. Maybe not to EOF. 05/04/16 07:19:04 **************************************************** 05/04/16 07:19:04 condor_startd (CONDOR_STARTD) STARTING UP 05/04/16 07:19:04 /usr/sbin/condor_startd 05/04/16 07:19:04 SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 05/04/16 07:19:04 Configuration: subsystem:STARTD local:<NONE> class:DAEMON 05/04/16 07:19:04 $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $ 05/04/16 07:19:04 $CondorPlatform: x86_64_RedHat6 $ 05/04/16 07:19:04 PID = 4268 05/04/16 07:19:04 Log last touched time unavailable (No such file or directory) 05/04/16 07:19:04 **************************************************** 05/04/16 07:19:04 Using config source: /etc/condor/condor_config 05/04/16 07:19:04 Using local config sources: 05/04/16 07:19:04 /etc/condor/config.d/10_security.config 05/04/16 07:19:04 /etc/condor/config.d/14_network.config 05/04/16 07:19:04 /etc/condor/config.d/20_workernode.config 05/04/16 07:19:04 /etc/condor/config.d/30_lease.config 05/04/16 07:19:04 /etc/condor/config.d/35_theory.config 05/04/16 07:19:04 /etc/condor/config.d/40_ccb.config 05/04/16 07:19:04 /etc/condor/condor_config.local 05/04/16 07:19:04 Daemon Log is logging: D_ALWAYS D_ERROR 05/04/16 07:19:04 DaemonCore: command socket at <10.0.2.15:46360?noUDP> 05/04/16 07:19:04 DaemonCore: private command socket at <10.0.2.15:46360> 05/04/16 07:19:26 CCBListener: heartbeat disabled because interval is configured to be 0 05/04/16 07:19:26 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#37354 05/04/16 07:19:26 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state 05/04/16 07:19:32 VM-gahp server reported an internal error 05/04/16 07:19:32 VM universe will be tested to check if it is available 05/04/16 07:19:32 History file rotation is enabled. 05/04/16 07:19:32 Maximum history file size is: 20971520 bytes 05/04/16 07:19:32 Number of rotated history files is: 2 slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 1500, Swap: 100.00%, Disk: 100.00% 05/04/16 07:19:32 New machine resource allocated 05/04/16 07:19:32 CronJobList: Adding job 'mips' 05/04/16 07:19:32 CronJobList: Adding job 'kflops' 05/04/16 07:19:32 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 05/04/16 07:19:32 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 05/04/16 07:19:32 State change: IS_OWNER is false 05/04/16 07:19:32 Changing state: Owner -> Unclaimed 05/04/16 07:19:32 State change: RunBenchmarks is TRUE 05/04/16 07:19:32 Changing activity: Idle -> Benchmarking 05/04/16 07:19:32 BenchMgr:StartBenchmarks() 05/04/16 07:19:54 State change: benchmarks completed 05/04/16 07:19:54 Changing activity: Benchmarking -> Idle ID: 3228 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1161 Credit: 342,328 RAC: 0	Message 3231 - Posted: 4 May 2016, 8:01:38 UTC - in response to Message 3228. Yes, you should have seen that error code. Although to get this back into the upstream release of BOINC it is now EXIT_NO_SUB_TASKS. ID: 3231 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3233 - Posted: 4 May 2016, 8:10:48 UTC - in response to Message 3231. I had a CMS task exiting for no reason. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169678 In addition, the stderr log does not show the job id (dashboard) any more. ID: 3233 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1280 Credit: 1,047,486 RAC: 56	Message 3383 - Posted: 18 May 2016, 6:54:07 UTC - in response to Message 3127. EDIT: To get this into the upstream release the codes have changed to: EXIT_NO_SUB_TASKS EXIT_TASK_FAILURE I didn't got 1 single job in this task http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=175593 and should have seen the exit code: EXIT_NO_SUB_TASKS ID: 3383 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 25	Message 3384 - Posted: 18 May 2016, 7:46:28 UTC - in response to Message 3383. EDIT: To get this into the upstream release the codes have changed to: EXIT_NO_SUB_TASKS EXIT_TASK_FAILURE I didn't got 1 single job in this task http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=175593 and should have seen the exit code: EXIT_NO_SUB_TASKS That's the mysterious "Condor shuts down immediately" bug. The error signalling mechanism may not have even started -- as a wild guess. ID: 3384 · Rating: 0 · rate: / Reply Quote

Development for LHC@home