Message boards :
News :
Error Codes
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
At the beginning of next week work will start on providing consistent error codes and behaviour for all applications. Three error codes will be used:
|
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
Not sure where those error codes should appear. 2 tasks stopped prematurely, maybe because of lack of jobs. None of the error codes found. 05/02/16 18:13:27 Process exited, pid=2474, status=0 05/02/16 18:13:27 About to exec Post script: /var/lib/condor/execute/dir_2470/tarOutput.sh 2016-556008-224 05/02/16 18:13:27 Create_Process succeeded, pid=26865 05/02/16 18:13:28 Process exited, pid=26865, status=0 05/02/16 18:13:28 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:13:28 Buf::write(): condor_write() failed 05/02/16 18:13:28 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:13:28 Buf::write(): condor_write() failed 05/02/16 18:13:28 Failed to send job exit status to shadow 05/02/16 18:13:28 JobExit() failed, waiting for job lease to expire or for a reconnect attempt 05/02/16 18:30:52 Got SIGQUIT. Performing fast shutdown. 05/02/16 18:30:52 ShutdownFast all jobs. 05/02/16 18:30:52 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11 05/02/16 18:30:52 Buf::write(): condor_write() failed 05/02/16 18:30:52 Failed to send job exit status to shadow 05/02/16 18:30:52 JobExit() failed, waiting for job lease to expire or for a reconnect attempt |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Take a look at your last two tasks that both started around 11:30. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168396 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168324 First notice the new logging entries. :) 2016-05-02 11:36:31 (5272): Guest Log: [INFO] New Job Starting It seems that around 17:45 the tasks were paused and resumed after a few minutes. Task 168324 seemed to resume ok and the shut down normally. 2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished Whereas task 168396, did not show that the job had finished. 2016-05-02 18:35:41 (9356): Guest Log: [INFO] Condor exited with 0 2016-05-02 18:35:41 (9356): Guest Log: [INFO] Shutting Down. 11:30 - 18:30 is only 7 hours so I would have thought they would run for longer. It will be interesting to see how your future tasks go. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
First notice the new logging entries. :) Nice! It seems that around 17:45 the tasks were paused and resumed after a few minutes. BOINC's version change was the reason: 02 May 17:48:25 Version change (7.6.29 -> 7.6.32) Not only paused, but also saved to disk. 11:30 - 18:30 is only 7 hours so I would have thought they would run for longer. During task 168324 no new job started. It was just idling too long: 2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished 2016-05-02 18:37:48 (15008): Guest Log: [INFO] Condor exited with 0 But during task 168396 last MCPlots JobID and Job Finished is missing. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
2016-05-02 13:47:48 (1008): Guest Log: [INFO] New Job Starting Nice feature! |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
It was a request from Ben. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
It will be interesting to see how your future tasks go. 2 next results had an elapsed time of 12.5 and 13.5 hours and a normal finish. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168814 http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169004 Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? What's the meaning of the number e.g.: Guest Log: 287436 I had 2 tasks with the new error code: 206 (0xce) EXIT_INIT_FAILURE (server or network problems) btw Typo in stderr.log: Guest Log: [ERROR] Cloud not get an x509 credential |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Same questions: What's the meaning of the number e.g.: Guest Log: 287436 Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? What is the meaning of "Condor JobID: 1", if it is the same for all jobs? 2016-05-03 02:37:15 (3700): Guest Log: [INFO] Job Finished |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown? After 12 hours, it will not get new jobs. Automatic shutdown occurs if the no job has been received within 5 mins. This could be optimized but good enough for now.
It is the CondorJobID
Perfect! Working as designed. I don't know why it temporarily failed though.
Thanks. |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
This should be the same line. Need to investigate EDIT: Fixed published, available in about one hour. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
* EXIT_NO_JOBS (When the job queues are empty) After an initial start of a new task, the VM didn't get jobs. Should I not have seen above error code? http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169237 I was able to catch the StartLog before the VM was killed. Maybe not to EOF. 05/04/16 07:19:04 ****************************************************** 05/04/16 07:19:04 ** condor_startd (CONDOR_STARTD) STARTING UP 05/04/16 07:19:04 ** /usr/sbin/condor_startd 05/04/16 07:19:04 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1) 05/04/16 07:19:04 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON 05/04/16 07:19:04 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $ 05/04/16 07:19:04 ** $CondorPlatform: x86_64_RedHat6 $ 05/04/16 07:19:04 ** PID = 4268 05/04/16 07:19:04 ** Log last touched time unavailable (No such file or directory) 05/04/16 07:19:04 ****************************************************** 05/04/16 07:19:04 Using config source: /etc/condor/condor_config 05/04/16 07:19:04 Using local config sources: 05/04/16 07:19:04 /etc/condor/config.d/10_security.config 05/04/16 07:19:04 /etc/condor/config.d/14_network.config 05/04/16 07:19:04 /etc/condor/config.d/20_workernode.config 05/04/16 07:19:04 /etc/condor/config.d/30_lease.config 05/04/16 07:19:04 /etc/condor/config.d/35_theory.config 05/04/16 07:19:04 /etc/condor/config.d/40_ccb.config 05/04/16 07:19:04 /etc/condor/condor_config.local 05/04/16 07:19:04 Daemon Log is logging: D_ALWAYS D_ERROR 05/04/16 07:19:04 DaemonCore: command socket at <10.0.2.15:46360?noUDP> 05/04/16 07:19:04 DaemonCore: private command socket at <10.0.2.15:46360> 05/04/16 07:19:26 CCBListener: heartbeat disabled because interval is configured to be 0 05/04/16 07:19:26 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#37354 05/04/16 07:19:26 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state 05/04/16 07:19:32 VM-gahp server reported an internal error 05/04/16 07:19:32 VM universe will be tested to check if it is available 05/04/16 07:19:32 History file rotation is enabled. 05/04/16 07:19:32 Maximum history file size is: 20971520 bytes 05/04/16 07:19:32 Number of rotated history files is: 2 slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto slot type 0: Cpus: 1, Memory: 1500, Swap: 100.00%, Disk: 100.00% 05/04/16 07:19:32 New machine resource allocated 05/04/16 07:19:32 CronJobList: Adding job 'mips' 05/04/16 07:19:32 CronJobList: Adding job 'kflops' 05/04/16 07:19:32 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips) 05/04/16 07:19:32 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops) 05/04/16 07:19:32 State change: IS_OWNER is false 05/04/16 07:19:32 Changing state: Owner -> Unclaimed 05/04/16 07:19:32 State change: RunBenchmarks is TRUE 05/04/16 07:19:32 Changing activity: Idle -> Benchmarking 05/04/16 07:19:32 BenchMgr:StartBenchmarks() 05/04/16 07:19:54 State change: benchmarks completed 05/04/16 07:19:54 Changing activity: Benchmarking -> Idle |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Yes, you should have seen that error code. Although to get this back into the upstream release of BOINC it is now EXIT_NO_SUB_TASKS. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I had a CMS task exiting for no reason. http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169678 In addition, the stderr log does not show the job id (dashboard) any more. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
EDIT: To get this into the upstream release the codes have changed to: I didn't got 1 single job in this task http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=175593 and should have seen the exit code: EXIT_NO_SUB_TASKS |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
EDIT: To get this into the upstream release the codes have changed to: That's the mysterious "Condor shuts down immediately" bug. The error signalling mechanism may not have even started -- as a wild guess. |
©2024 CERN