Thread 'Dip?'

Author	Message
ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4377 - Posted: 1 Dec 2016, 11:45:47 UTC - in response to Message 4375. As i mentioned before: http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=306&postid=4301#4301 NO MATTER WHAT THE "Max # of CPUs for this project" SETTINGS ARE, I ALWAYS GET A 1 CORE 1896MB TASK. (when not using an app_config.xml) I see you have 4 "unclaimed" slots on the Condor server[] and have just had a couple of task failures. Are you trying to run a 4-core VM at the moment? If you don't have an app_config.xml, does BOINC download a new one for you? []Name OpSys Arch State Activity LoadAv Mem ActvtyTime slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 ID: 4377 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4378 - Posted: 1 Dec 2016, 11:50:33 UTC - in response to Message 4377. Hmm, now you have 8 slots, two claimed & busy, but no apparent task in progress: slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36 slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36 ID: 4378 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4379 - Posted: 1 Dec 2016, 11:52:42 UTC - in response to Message 4378. Make that 12... slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38 slot4@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:37 slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38 slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07 slot3@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:36 slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38 slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06 slot2@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:35 slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36 slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36 slot1@277-1518-213 LINUX X86_64 Unclaimed Idle 0.310 1875 0+00:01:07 slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36 ID: 4379 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4380 - Posted: 1 Dec 2016, 11:54:43 UTC - in response to Message 4378. I am trying to figure out, how much memory i need for a 4 core task. So far, it has to be more than 5120MB and less than 7168MB. ID: 4380 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4381 - Posted: 1 Dec 2016, 12:30:24 UTC - in response to Message 4380. I am trying to figure out, how much memory i need for a 4 core task. So far, it has to be more than 5120MB and less than 7168MB. Some of the condor slots have now gone away, only the slot1@277-1518-16709 ones are left, two of them busy. I traced one to JobNo 2143 so I'm watching its log file. Hmm, gone now, and no change to the log file. ID: 4381 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4382 - Posted: 1 Dec 2016, 12:35:32 UTC - in response to Message 4381. Last modified: 1 Dec 2016, 12:42:05 UTC I let 1 task with 4 cores and 4608MB memory error out. I usually shut them down, before that happens to not exhaust the quota. You are also just running single core tasks; on purpose? 5248MB for 4 core task are working! ID: 4382 · Rating: 0 · rate: / Reply Quote

captainjack Send message Joined: 18 Aug 15 Posts: 14 Credit: 137,212 RAC: 0	Message 4383 - Posted: 1 Dec 2016, 14:05:45 UTC Ivan asked: Actually... Can you please post your app_config.xml so that the more-experienced volunteers can have a chance to critique it? Thanks. I was not using an app_config.xml. Ivan also asked in a recent thread: If you don't have an app_config.xml, does BOINC download a new one for you? BOINC did not download an app_config.xml for me. I just tried another CMS task and got the same result. Some relevant messages below: <message> The filename or extension is too long. (0xce) - exit code 206 (0xce) </message> 2016-12-01 07:43:34 (10148): Setting Memory Size for VM. (2048MB) 2016-12-01 07:43:34 (10148): Setting CPU Count for VM. (3) 2016-12-01 07:45:42 (10148): Guest Log: [INFO] CMS application starting. Check log files. 2016-12-01 07:45:42 (10148): Guest Log: [DEBUG] HTCondor ping 2016-12-01 07:45:52 (10148): Guest Log: [DEBUG] 0 2016-12-01 07:56:53 (10148): Guest Log: [ERROR] Condor exited after 673s without running a job. 2016-12-01 07:56:53 (10148): Guest Log: [INFO] Shutting Down Please let me know if I can provide more info. ID: 4383 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4387 - Posted: 1 Dec 2016, 17:40:21 UTC - in response to Message 4383. Ivan also asked in a recent thread: If you don't have an app_config.xml, does BOINC download a new one for you? BOINC did not download an app_config.xml for me. :-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. ID: 4387 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,857 RAC: 94	Message 4389 - Posted: 1 Dec 2016, 18:05:28 UTC - in response to Message 4387. :-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. Which application did you request and with which URL are you connected? I had a fresh https-attach to the dev-project (only CMS selected) and didn't get an app_config.xml (no BOINC (re)start). ID: 4389 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4393 - Posted: 1 Dec 2016, 18:24:45 UTC - in response to Message 4389. :-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded. Which application did you request and with which URL are you connected? I had a fresh https-attach to the dev-project (only CMS selected) and didn't get an app_config.xml (no BOINC (re)start). I've only got CMS selected for my work locale, but I see this machine is still using the http: URL in project properties. The tasks have https: in boinccms --get_tasks but the master .xml file seems screwed up. I'll fix that tomorrow when the two current tasks finish. ID: 4393 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4402 - Posted: 2 Dec 2016, 9:12:40 UTC Looks like the CRAB CouchDB database is up again, we're getting green in our jobs graphs again. IIRC they are now shadowing it with a second instance, and will be trying to move it to an Oracle implementation next week. Let's hope that cures our recent problems... ID: 4402 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 543 Credit: 400,710 RAC: 0	Message 4420 - Posted: 3 Dec 2016, 11:17:04 UTC Just to inform the admins. Dashboard becomes red again and my most recent WU shows a couple of messages like "Job finished in slotx with 151". ID: 4420 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4422 - Posted: 3 Dec 2016, 12:39:13 UTC - in response to Message 4420. Just to inform the admins. Dashboard becomes red again and my most recent WU shows a couple of messages like "Job finished in slotx with 151". Yes, it looks like this is the usual failure of retries because the result file actually exists from the first job which was erroneously flagged as a failure due to the earlier CRAB problems. The failures are .2. versions, and .3. retries are being set up. It's annoying for me as well as you, but at least we grant credit based on CPU time, not files received. :-/ ID: 4422 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,045,857 RAC: 94	Message 4423 - Posted: 3 Dec 2016, 12:42:12 UTC - in response to Message 4420. Just to inform the admins. Dashboard becomes red again and my most recent WU shows a couple of messages like "Job finished in slotx with 151". Over the last twelve hours I returned 12 jobs. 3 exited with the successful exit code and 9 with exit code 151. Stageout wrapper finished with exit code 60311. Will report failure to Dashboard. ID: 4423 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 543 Credit: 400,710 RAC: 0	Message 4448 - Posted: 4 Dec 2016, 20:20:22 UTC - in response to Message 4422. ... failure of retries because the result file actually exists ... What happens if the BOINC client suspends CMS and runs another project just during the upload. Will the upload continue after CMS reschedules or do we run in an error? BTW: I noticed that CMS uses (or used) the PUT method for uploads while most other uploads use the POST method. I´m not a HTTP specialist but IIRC both methods behave different if a file exist on the server. ID: 4448 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4451 - Posted: 4 Dec 2016, 21:32:46 UTC - in response to Message 4448. Err..., umm..., Laurence? ID: 4451 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 4492 - Posted: 14 Dec 2016, 19:55:31 UTC Last modified: 14 Dec 2016, 20:50:41 UTC Problems? Can't get jobs. Running jobs numbers falling. 2016-12-14 21:45:35 (5266): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev 2016-12-14 21:45:36 (5266): Guest Log: [INFO] CMS application starting. Check log files. 2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] HTCondor ping 2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] 1 2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] 12/14/16 21:45:36 recognized DC_NOP as command name, using command 60011. 2016-12-14 21:45:36 (5266): Guest Log: 12/14/16 21:45:36 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111). 2016-12-14 21:45:36 (5266): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623> 2016-12-14 21:45:36 (5266): Guest Log: [ERROR] Could not ping HTCondor. 2016-12-14 21:45:36 (5266): Guest Log: [INFO] Shutting Down. 2016-12-14 21:45:36 (5266): VM Completion File Detected. ID: 4492 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1159 Credit: 342,328 RAC: 0	Message 4493 - Posted: 14 Dec 2016, 23:17:38 UTC - in response to Message 4492. Problems? Maybe, looks like the server at RAL is down. ID: 4493 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4494 - Posted: 14 Dec 2016, 23:57:46 UTC - in response to Message 4493. Problems? Maybe, looks like the server at RAL is down. Server is up, but Condor is down(?): [cms005@lcggwms02:~] > condor_status Error: communication error CEDAR:6001:Failed to connect to <130.246.180.120:9618> Wed Dec 14 23:47:57 [cms005@lcggwms02:~] > condor_q -- Failed to fetch ads from: <130.246.180.120:9818?noUDP&sock=6553_2151> : lcggwms02.gridpp.rl.ac.uk CEDAR:6001:Failed to connect to <130.246.180.120:9818?noUDP&sock=6553_2151> Wed Dec 14 23:48:06 Have to wait for RAL IT to react, I'm afraid. ID: 4494 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 182	Message 4495 - Posted: 15 Dec 2016, 9:09:12 UTC - in response to Message 4494. The disk that generic Condor logs go to filled up -- unfortunately not the one we use that I monitor closely. Andrew has cleared off space pro tem and will do a full clean-out when he has the chance. Sorry, I could have noticed that and taken action myself. ID: 4495 · Rating: 0 · rate: / Reply Quote

Development for LHC@home