Message boards :
CMS Application :
Dip?
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
As i mentioned before: I see you have 4 "unclaimed" slots on the Condor server[*] and have just had a couple of task failures. Are you trying to run a 4-core VM at the moment? If you don't have an app_config.xml, does BOINC download a new one for you? [*]Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Hmm, now you have 8 slots, two claimed & busy, but no apparent task in progress: slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36 slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36 |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Make that 12... slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 slot4@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:37 slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot3@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:36 slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot2@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:35 slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36 slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot1@277-1518-213 LINUX X86_64 Unclaimed Idle 0.310 1875 0+00:01:07 slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I am trying to figure out, how much memory i need for a 4 core task. So far, it has to be more than 5120MB and less than 7168MB. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
I am trying to figure out, how much memory i need for a 4 core task. Some of the condor slots have now gone away, only the slot1@277-1518-16709 ones are left, two of them busy. I traced one to JobNo 2143 so I'm watching its log file. Hmm, gone now, and no change to the log file. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I let 1 task with 4 cores and 4608MB memory error out. I usually shut them down, before that happens to not exhaust the quota. You are also just running single core tasks; on purpose? 5248MB for 4 core task are working! |
Send message Joined: 18 Aug 15 Posts: 14 Credit: 125,335 RAC: 0 |
Ivan asked: Actually... Can you please post your app_config.xml so that the more-experienced volunteers can have a chance to critique it? Thanks. I was not using an app_config.xml. Ivan also asked in a recent thread: If you don't have an app_config.xml, does BOINC download a new one for you? BOINC did not download an app_config.xml for me. I just tried another CMS task and got the same result. Some relevant messages below: <message> The filename or extension is too long. (0xce) - exit code 206 (0xce) </message> 2016-12-01 07:43:34 (10148): Setting Memory Size for VM. (2048MB) 2016-12-01 07:43:34 (10148): Setting CPU Count for VM. (3) 2016-12-01 07:45:42 (10148): Guest Log: [INFO] CMS application starting. Check log files. 2016-12-01 07:45:42 (10148): Guest Log: [DEBUG] HTCondor ping 2016-12-01 07:45:52 (10148): Guest Log: [DEBUG] 0 2016-12-01 07:56:53 (10148): Guest Log: [ERROR] Condor exited after 673s without running a job. 2016-12-01 07:56:53 (10148): Guest Log: [INFO] Shutting Down Please let me know if I can provide more info. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Ivan also asked in a recent thread: :-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,389 RAC: 503 |
:-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. Which application did you request and with which URL are you connected? I had a fresh https-attach to the dev-project (only CMS selected) and didn't get an app_config.xml (no BOINC (re)start). |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
:-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. I've only got CMS selected for my work locale, but I see this machine is still using the http: URL in project properties. The tasks have https: in boinccms --get_tasks but the master .xml file seems screwed up. I'll fix that tomorrow when the two current tasks finish. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Looks like the CRAB CouchDB database is up again, we're getting green in our jobs graphs again. IIRC they are now shadowing it with a second instance, and will be trying to move it to an Oracle implementation next week. Let's hope that cures our recent problems... |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
Just to inform the admins. Dashboard becomes red again and my most recent WU shows a couple of messages like "Job finished in slotx with 151". |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Just to inform the admins. Yes, it looks like this is the usual failure of retries because the result file actually exists from the first job which was erroneously flagged as a failure due to the earlier CRAB problems. The failures are .2. versions, and .3. retries are being set up. It's annoying for me as well as you, but at least we grant credit based on CPU time, not files received. :-/ |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 877,389 RAC: 503 |
Just to inform the admins. Over the last twelve hours I returned 12 jobs. 3 exited with the successful exit code and 9 with exit code 151. Stageout wrapper finished with exit code 60311. Will report failure to Dashboard. |
Send message Joined: 28 Jul 16 Posts: 485 Credit: 394,839 RAC: 0 |
... failure of retries because the result file actually exists ... What happens if the BOINC client suspends CMS and runs another project just during the upload. Will the upload continue after CMS reschedules or do we run in an error? BTW: I noticed that CMS uses (or used) the PUT method for uploads while most other uploads use the POST method. I´m not a HTTP specialist but IIRC both methods behave different if a file exist on the server. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Err..., umm..., Laurence? |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Problems? Can't get jobs. Running jobs numbers falling. 2016-12-14 21:45:35 (5266): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev |
Send message Joined: 12 Sep 14 Posts: 1069 Credit: 334,882 RAC: 0 |
Problems? Maybe, looks like the server at RAL is down. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
Problems? Server is up, but Condor is down(?): [cms005@lcggwms02:~] > condor_status Error: communication error CEDAR:6001:Failed to connect to <130.246.180.120:9618> Wed Dec 14 23:47:57 [cms005@lcggwms02:~] > condor_q -- Failed to fetch ads from: <130.246.180.120:9818?noUDP&sock=6553_2151> : lcggwms02.gridpp.rl.ac.uk CEDAR:6001:Failed to connect to <130.246.180.120:9818?noUDP&sock=6553_2151> Wed Dec 14 23:48:06 Have to wait for RAL IT to react, I'm afraid. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 3 |
The disk that generic Condor logs go to filled up -- unfortunately not the one we use that I monitor closely. Andrew has cleared off space pro tem and will do a full clean-out when he has the chance. Sorry, I could have noticed that and taken action myself. |
©2025 CERN