Message boards : News : Error Codes
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3127 - Posted: 30 Apr 2016, 17:51:59 UTC
Last modified: 4 May 2016, 8:03:03 UTC

At the beginning of next week work will start on providing consistent error codes and behaviour for all applications. Three error codes will be used:

    * EXIT_INIT_FAILURE (When an error is detected on contextualizing the VM or setting up the job environment)
    * EXIT_NO_JOBS (When the job queues are empty)
    * EXIT_JOB_FAILURE (When an error is detected that caused all jobs to fail or all jobs have failed)


Any of these errors should cause the BOINC client to back-off. If you see any errors and one of these codes is not used, please let us know.

EDIT: To get this into the upstream release the codes have changed to:


    * EXIT_NO_SUB_TASKS
    * EXIT_TASK_FAILURE

ID: 3127 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 2
Message 3168 - Posted: 2 May 2016, 16:44:23 UTC - in response to Message 3127.  

Not sure where those error codes should appear.
2 tasks stopped prematurely, maybe because of lack of jobs.
None of the error codes found.

05/02/16 18:13:27 Process exited, pid=2474, status=0
05/02/16 18:13:27 About to exec Post script: /var/lib/condor/execute/dir_2470/tarOutput.sh 2016-556008-224
05/02/16 18:13:27 Create_Process succeeded, pid=26865
05/02/16 18:13:28 Process exited, pid=26865, status=0
05/02/16 18:13:28 condor_write(): Socket closed when trying to write 583 bytes to <188.184.187.167:9618>, fd is 11
05/02/16 18:13:28 Buf::write(): condor_write() failed
05/02/16 18:13:28 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11
05/02/16 18:13:28 Buf::write(): condor_write() failed
05/02/16 18:13:28 Failed to send job exit status to shadow
05/02/16 18:13:28 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
05/02/16 18:30:52 Got SIGQUIT. Performing fast shutdown.
05/02/16 18:30:52 ShutdownFast all jobs.
05/02/16 18:30:52 condor_write(): Socket closed when trying to write 366 bytes to <188.184.187.167:9618>, fd is 11
05/02/16 18:30:52 Buf::write(): condor_write() failed
05/02/16 18:30:52 Failed to send job exit status to shadow
05/02/16 18:30:52 JobExit() failed, waiting for job lease to expire or for a reconnect attempt
ID: 3168 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3170 - Posted: 2 May 2016, 19:07:31 UTC - in response to Message 3168.  

Take a look at your last two tasks that both started around 11:30.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168396
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168324

First notice the new logging entries. :)

2016-05-02 11:36:31 (5272): Guest Log: [INFO] New Job Starting
2016-05-02 11:36:41 (5272): Guest Log: [INFO] MCPlots JobID: 29942769
2016-05-02 12:45:41 (5272): Guest Log: [INFO] Job Finished


It seems that around 17:45 the tasks were paused and resumed after a few minutes.

Task 168324 seemed to resume ok and the shut down normally.

2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished
2016-05-02 18:37:48 (15008): Guest Log: [INFO] Condor exited with 0
2016-05-02 18:37:48 (15008): Guest Log: [INFO] Shutting Down.


Whereas task 168396, did not show that the job had finished.

2016-05-02 18:35:41 (9356): Guest Log: [INFO] Condor exited with 0
2016-05-02 18:35:41 (9356): Guest Log: [INFO] Shutting Down.

11:30 - 18:30 is only 7 hours so I would have thought they would run for longer.

It will be interesting to see how your future tasks go.
ID: 3170 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 2
Message 3173 - Posted: 2 May 2016, 20:01:17 UTC - in response to Message 3170.  
Last modified: 2 May 2016, 20:36:52 UTC

First notice the new logging entries. :)

Nice!

It seems that around 17:45 the tasks were paused and resumed after a few minutes.

BOINC's version change was the reason: 02 May 17:48:25 Version change (7.6.29 -> 7.6.32)
Not only paused, but also saved to disk.

11:30 - 18:30 is only 7 hours so I would have thought they would run for longer.

During task 168324 no new job started. It was just idling too long:

2016-05-02 18:13:51 (15008): Guest Log: [INFO] Job Finished
2016-05-02 18:37:48 (15008): Guest Log: [INFO] Condor exited with 0


But during task 168396 last MCPlots JobID and Job Finished is missing.
ID: 3173 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 3176 - Posted: 2 May 2016, 20:15:22 UTC
Last modified: 2 May 2016, 20:23:39 UTC

2016-05-02 13:47:48 (1008): Guest Log: [INFO] New Job Starting
2016-05-02 13:47:59 (1008): Guest Log: [INFO] MCPlots JobID: 29924967
2016-05-02 14:49:48 (1008): Guest Log: [INFO] Job Finished



Nice feature!
Could you add a time stamp to it?Never mind.
ID: 3176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3179 - Posted: 2 May 2016, 20:25:41 UTC - in response to Message 3176.  

It was a request from Ben.
ID: 3179 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 2
Message 3184 - Posted: 3 May 2016, 6:58:50 UTC - in response to Message 3170.  

It will be interesting to see how your future tasks go.

2 next results had an elapsed time of 12.5 and 13.5 hours and a normal finish.
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=168814
http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169004

Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown?
What's the meaning of the number e.g.: Guest Log: 287436

I had 2 tasks with the new error code: 206 (0xce) EXIT_INIT_FAILURE (server or network problems)
btw Typo in stderr.log: Guest Log: [ERROR] Cloud not get an x509 credential
ID: 3184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 3187 - Posted: 3 May 2016, 7:55:34 UTC

Same questions:
What's the meaning of the number e.g.: Guest Log: 287436

Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown?


What is the meaning of "Condor JobID: 1", if it is the same for all jobs?


2016-05-03 02:37:15 (3700): Guest Log: [INFO] Job Finished
2016-05-03 02:37:45 (3700): Guest Log: [INFO] New Job Starting
2016-05-03 02:37:45 (3700): Guest Log: [INFO] Condor JobID: 1
2016-05-03 02:37:45 (3700): Guest Log: 286286
2016-05-03 02:37:55 (3700): Guest Log: [INFO] MCPlots JobID: 29930843
2016-05-03 03:17:14 (3700): Guest Log: [INFO] Job Finished
2016-05-03 03:17:14 (3700): Guest Log: [INFO] New Job Starting
2016-05-03 03:17:14 (3700): Guest Log: [INFO] Condor JobID: 1
2016-05-03 03:17:14 (3700): Guest Log: 287333
2016-05-03 03:17:24 (3700): Guest Log: [INFO] MCPlots JobID: 29971685
2016-05-03 03:29:45 (3700): Guest Log: [INFO] Job Finished
2016-05-03 03:30:36 (3700): Guest Log: [INFO] New Job Starting
2016-05-03 03:30:36 (3700): Guest Log: [INFO] Condor JobID: 1
2016-05-03 03:30:36 (3700): Guest Log: 287343
2016-05-03 03:30:46 (3700): Guest Log: [INFO] MCPlots JobID: 2997
ID: 3187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3190 - Posted: 3 May 2016, 9:11:38 UTC - in response to Message 3184.  
Last modified: 3 May 2016, 9:12:23 UTC

Why it takes 5 minutes or more between the last job finish and condor exiting and VM shutdown?

After 12 hours, it will not get new jobs. Automatic shutdown occurs if the no job has been received within 5 mins. This could be optimized but good enough for now.

What's the meaning of the number e.g.: Guest Log: 287436

It is the CondorJobID

I had 2 tasks with the new error code: 206 (0xce) EXIT_INIT_FAILURE (server or network problems)

Perfect! Working as designed. I don't know why it temporarily failed though.

btw Typo in stderr.log: Guest Log: [ERROR] Cloud not get an x509 credential


Thanks.
ID: 3190 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3191 - Posted: 3 May 2016, 9:14:08 UTC - in response to Message 3187.  
Last modified: 3 May 2016, 9:18:12 UTC


2016-05-03 02:37:45 (3700): Guest Log: [INFO] Condor JobID: 1
2016-05-03 02:37:45 (3700): Guest Log: 286286


This should be the same line. Need to investigate

EDIT: Fixed published, available in about one hour.
ID: 3191 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 2
Message 3228 - Posted: 4 May 2016, 6:35:04 UTC - in response to Message 3127.  

* EXIT_NO_JOBS (When the job queues are empty)

After an initial start of a new task, the VM didn't get jobs.
Should I not have seen above error code?

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169237

I was able to catch the StartLog before the VM was killed. Maybe not to EOF.

05/04/16 07:19:04 ******************************************************
05/04/16 07:19:04 ** condor_startd (CONDOR_STARTD) STARTING UP
05/04/16 07:19:04 ** /usr/sbin/condor_startd
05/04/16 07:19:04 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
05/04/16 07:19:04 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
05/04/16 07:19:04 ** $CondorVersion: 8.0.6 Feb 01 2014 BuildID: 225363 $
05/04/16 07:19:04 ** $CondorPlatform: x86_64_RedHat6 $
05/04/16 07:19:04 ** PID = 4268
05/04/16 07:19:04 ** Log last touched time unavailable (No such file or directory)
05/04/16 07:19:04 ******************************************************
05/04/16 07:19:04 Using config source: /etc/condor/condor_config
05/04/16 07:19:04 Using local config sources:
05/04/16 07:19:04 /etc/condor/config.d/10_security.config
05/04/16 07:19:04 /etc/condor/config.d/14_network.config
05/04/16 07:19:04 /etc/condor/config.d/20_workernode.config
05/04/16 07:19:04 /etc/condor/config.d/30_lease.config
05/04/16 07:19:04 /etc/condor/config.d/35_theory.config
05/04/16 07:19:04 /etc/condor/config.d/40_ccb.config
05/04/16 07:19:04 /etc/condor/condor_config.local
05/04/16 07:19:04 Daemon Log is logging: D_ALWAYS D_ERROR
05/04/16 07:19:04 DaemonCore: command socket at <10.0.2.15:46360?noUDP>
05/04/16 07:19:04 DaemonCore: private command socket at <10.0.2.15:46360>
05/04/16 07:19:26 CCBListener: heartbeat disabled because interval is configured to be 0
05/04/16 07:19:26 CCBListener: registered with CCB server alicondor01.cern.ch as ccbid 188.184.129.127:9618?addrs=188.184.129.127-9618&noUDP&sock=collector#37354
05/04/16 07:19:26 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state
05/04/16 07:19:32 VM-gahp server reported an internal error
05/04/16 07:19:32 VM universe will be tested to check if it is available
05/04/16 07:19:32 History file rotation is enabled.
05/04/16 07:19:32 Maximum history file size is: 20971520 bytes
05/04/16 07:19:32 Number of rotated history files is: 2
slot type 0: Cpus: 1, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1, Memory: 1500, Swap: 100.00%, Disk: 100.00%
05/04/16 07:19:32 New machine resource allocated
05/04/16 07:19:32 CronJobList: Adding job 'mips'
05/04/16 07:19:32 CronJobList: Adding job 'kflops'
05/04/16 07:19:32 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
05/04/16 07:19:32 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
05/04/16 07:19:32 State change: IS_OWNER is false
05/04/16 07:19:32 Changing state: Owner -> Unclaimed
05/04/16 07:19:32 State change: RunBenchmarks is TRUE
05/04/16 07:19:32 Changing activity: Idle -> Benchmarking
05/04/16 07:19:32 BenchMgr:StartBenchmarks()
05/04/16 07:19:54 State change: benchmarks completed
05/04/16 07:19:54 Changing activity: Benchmarking -> Idle
ID: 3228 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 3231 - Posted: 4 May 2016, 8:01:38 UTC - in response to Message 3228.  

Yes, you should have seen that error code. Although to get this back into the upstream release of BOINC it is now EXIT_NO_SUB_TASKS.
ID: 3231 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 3233 - Posted: 4 May 2016, 8:10:48 UTC - in response to Message 3231.  

I had a CMS task exiting for no reason.

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=169678


In addition, the stderr log does not show the job id (dashboard) any more.
ID: 3233 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 2
Message 3383 - Posted: 18 May 2016, 6:54:07 UTC - in response to Message 3127.  

EDIT: To get this into the upstream release the codes have changed to:

    * EXIT_NO_SUB_TASKS
    * EXIT_TASK_FAILURE


I didn't got 1 single job in this task

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=175593

and should have seen the exit code: EXIT_NO_SUB_TASKS
ID: 3383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1093
Credit: 6,893,316
RAC: 0
Message 3384 - Posted: 18 May 2016, 7:46:28 UTC - in response to Message 3383.  

EDIT: To get this into the upstream release the codes have changed to:

    * EXIT_NO_SUB_TASKS
    * EXIT_TASK_FAILURE


I didn't got 1 single job in this task

http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=175593

and should have seen the exit code: EXIT_NO_SUB_TASKS

That's the mysterious "Condor shuts down immediately" bug. The error signalling mechanism may not have even started -- as a wild guess.
ID: 3384 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : News : Error Codes


©2020 CERN