Message boards : CMS Application : Dip?
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4377 - Posted: 1 Dec 2016, 11:45:47 UTC - in response to Message 4375.  

As i mentioned before:

http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=306&postid=4301#4301

NO MATTER WHAT THE "Max # of CPUs for this project" SETTINGS ARE, I ALWAYS GET A 1 CORE 1896MB TASK.

(when not using an app_config.xml)

I see you have 4 "unclaimed" slots on the Condor server[*] and have just had a couple of task failures. Are you trying to run a 4-core VM at the moment?
If you don't have an app_config.xml, does BOINC download a new one for you?

[*]Name OpSys Arch State Activity LoadAv Mem ActvtyTime
slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36
slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06
slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07
slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38

ID: 4377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4378 - Posted: 1 Dec 2016, 11:50:33 UTC - in response to Message 4377.  

Hmm, now you have 8 slots, two claimed & busy, but no apparent task in progress:

slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38
slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38
slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07
slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38
slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06
slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36
slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36
slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36
ID: 4378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4379 - Posted: 1 Dec 2016, 11:52:42 UTC - in response to Message 4378.  

Make that 12...

slot4@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:38
slot4@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:37
slot4@277-1518-167 LINUX X86_64 Unclaimed Idle 0.120 2625 0+00:01:38
slot3@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:07
slot3@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:36
slot3@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:38
slot2@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:08:06
slot2@277-1518-213 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:01:35
slot2@277-1518-167 LINUX X86_64 Unclaimed Idle 1.000 2625 0+00:01:36
slot1@277-1518-781 LINUX X86_64 Unclaimed Idle 0.000 1875 0+00:07:36
slot1@277-1518-213 LINUX X86_64 Unclaimed Idle 0.310 1875 0+00:01:07
slot1@277-1518-167 LINUX X86_64 Claimed Busy 0.120 2625 0+00:01:36
ID: 4379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4380 - Posted: 1 Dec 2016, 11:54:43 UTC - in response to Message 4378.  

I am trying to figure out, how much memory i need for a 4 core task.
So far, it has to be more than 5120MB and less than 7168MB.
ID: 4380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4381 - Posted: 1 Dec 2016, 12:30:24 UTC - in response to Message 4380.  

I am trying to figure out, how much memory i need for a 4 core task.
So far, it has to be more than 5120MB and less than 7168MB.

Some of the condor slots have now gone away, only the slot1@277-1518-16709 ones are left, two of them busy. I traced one to JobNo 2143 so I'm watching its log file.
Hmm, gone now, and no change to the log file.
ID: 4381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4382 - Posted: 1 Dec 2016, 12:35:32 UTC - in response to Message 4381.  
Last modified: 1 Dec 2016, 12:42:05 UTC

I let 1 task with 4 cores and 4608MB memory error out.
I usually shut them down, before that happens to not exhaust the quota.

You are also just running single core tasks; on purpose?

5248MB for 4 core task are working!
ID: 4382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
captainjack

Send message
Joined: 18 Aug 15
Posts: 14
Credit: 117,668
RAC: 1,309
Message 4383 - Posted: 1 Dec 2016, 14:05:45 UTC

Ivan asked:

Actually... Can you please post your app_config.xml so that the more-experienced volunteers can have a chance to critique it? Thanks.


I was not using an app_config.xml.

Ivan also asked in a recent thread:

If you don't have an app_config.xml, does BOINC download a new one for you?


BOINC did not download an app_config.xml for me.

I just tried another CMS task and got the same result. Some relevant messages below:

<message>
The filename or extension is too long.
(0xce) - exit code 206 (0xce)
</message>

2016-12-01 07:43:34 (10148): Setting Memory Size for VM. (2048MB)
2016-12-01 07:43:34 (10148): Setting CPU Count for VM. (3)

2016-12-01 07:45:42 (10148): Guest Log: [INFO] CMS application starting. Check log files.
2016-12-01 07:45:42 (10148): Guest Log: [DEBUG] HTCondor ping
2016-12-01 07:45:52 (10148): Guest Log: [DEBUG] 0
2016-12-01 07:56:53 (10148): Guest Log: [ERROR] Condor exited after 673s without running a job.
2016-12-01 07:56:53 (10148): Guest Log: [INFO] Shutting Down

Please let me know if I can provide more info.
ID: 4383 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4387 - Posted: 1 Dec 2016, 17:40:21 UTC - in response to Message 4383.  

Ivan also asked in a recent thread:

If you don't have an app_config.xml, does BOINC download a new one for you?


BOINC did not download an app_config.xml for me.


:-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded.
ID: 4387 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 4389 - Posted: 1 Dec 2016, 18:05:28 UTC - in response to Message 4387.  

:-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded.

Which application did you request and with which URL are you connected?
I had a fresh https-attach to the dev-project (only CMS selected) and didn't get an app_config.xml (no BOINC (re)start).
ID: 4389 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4393 - Posted: 1 Dec 2016, 18:24:45 UTC - in response to Message 4389.  

:-( I just stopped and restarted BOINC in the middle of some tests, and lo and behold! -- a new app_config.xml was downloaded.

Which application did you request and with which URL are you connected?
I had a fresh https-attach to the dev-project (only CMS selected) and didn't get an app_config.xml (no BOINC (re)start).

I've only got CMS selected for my work locale, but I see this machine is still using the http: URL in project properties. The tasks have https: in boinccms --get_tasks but the master .xml file seems screwed up. I'll fix that tomorrow when the two current tasks finish.
ID: 4393 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4402 - Posted: 2 Dec 2016, 9:12:40 UTC

Looks like the CRAB CouchDB database is up again, we're getting green in our jobs graphs again. IIRC they are now shadowing it with a second instance, and will be trying to move it to an Oracle implementation next week. Let's hope that cures our recent problems...
ID: 4402 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 555
Message 4420 - Posted: 3 Dec 2016, 11:17:04 UTC

Just to inform the admins.

Dashboard becomes red again and my most recent WU shows a couple of messages like
"Job finished in slotx with 151".
ID: 4420 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4422 - Posted: 3 Dec 2016, 12:39:13 UTC - in response to Message 4420.  

Just to inform the admins.

Dashboard becomes red again and my most recent WU shows a couple of messages like
"Job finished in slotx with 151".

Yes, it looks like this is the usual failure of retries because the result file actually exists from the first job which was erroneously flagged as a failure due to the earlier CRAB problems. The failures are .2. versions, and .3. retries are being set up. It's annoying for me as well as you, but at least we grant credit based on CPU time, not files received. :-/
ID: 4422 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 4423 - Posted: 3 Dec 2016, 12:42:12 UTC - in response to Message 4420.  

Just to inform the admins.

Dashboard becomes red again and my most recent WU shows a couple of messages like
"Job finished in slotx with 151".

Over the last twelve hours I returned 12 jobs.
3 exited with the successful exit code and 9 with exit code 151.
Stageout wrapper finished with exit code 60311. Will report failure to Dashboard.
ID: 4423 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 555
Message 4448 - Posted: 4 Dec 2016, 20:20:22 UTC - in response to Message 4422.  

... failure of retries because the result file actually exists ...

What happens if the BOINC client suspends CMS and runs another project just during the upload. Will the upload continue after CMS reschedules or do we run in an error?

BTW:
I noticed that CMS uses (or used) the PUT method for uploads while most other uploads use the POST method.
I´m not a HTTP specialist but IIRC both methods behave different if a file exist on the server.
ID: 4448 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4451 - Posted: 4 Dec 2016, 21:32:46 UTC - in response to Message 4448.  

Err..., umm..., Laurence?
ID: 4451 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 4492 - Posted: 14 Dec 2016, 19:55:31 UTC
Last modified: 14 Dec 2016, 20:50:41 UTC

Problems?

Can't get jobs. Running jobs numbers falling.

2016-12-14 21:45:35 (5266): Guest Log: [INFO] Requesting an X509 credential from vLHC@home-dev
2016-12-14 21:45:36 (5266): Guest Log: [INFO] CMS application starting. Check log files.
2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] HTCondor ping
2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] 1
2016-12-14 21:45:36 (5266): Guest Log: [DEBUG] 12/14/16 21:45:36 recognized DC_NOP as command name, using command 60011.
2016-12-14 21:45:36 (5266): Guest Log: 12/14/16 21:45:36 attempt to connect to <130.246.180.120:9623> failed: Connection refused (connect errno = 111).
2016-12-14 21:45:36 (5266): Guest Log: ERROR: failed to make connection to <130.246.180.120:9623>
2016-12-14 21:45:36 (5266): Guest Log: [ERROR] Could not ping HTCondor.
2016-12-14 21:45:36 (5266): Guest Log: [INFO] Shutting Down.
2016-12-14 21:45:36 (5266): VM Completion File Detected.
ID: 4492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,885
RAC: 273
Message 4493 - Posted: 14 Dec 2016, 23:17:38 UTC - in response to Message 4492.  

Problems?


Maybe, looks like the server at RAL is down.
ID: 4493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4494 - Posted: 14 Dec 2016, 23:57:46 UTC - in response to Message 4493.  

Problems?


Maybe, looks like the server at RAL is down.


Server is up, but Condor is down(?):

[cms005@lcggwms02:~] > condor_status
Error: communication error
CEDAR:6001:Failed to connect to <130.246.180.120:9618>

Wed Dec 14 23:47:57
[cms005@lcggwms02:~] >
condor_q
-- Failed to fetch ads from: <130.246.180.120:9818?noUDP&sock=6553_2151> : lcggwms02.gridpp.rl.ac.uk
CEDAR:6001:Failed to connect to <130.246.180.120:9818?noUDP&sock=6553_2151>

Wed Dec 14 23:48:06

Have to wait for RAL IT to react, I'm afraid.
ID: 4494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 4495 - Posted: 15 Dec 2016, 9:09:12 UTC - in response to Message 4494.  

The disk that generic Condor logs go to filled up -- unfortunately not the one we use that I monitor closely. Andrew has cleared off space pro tem and will do a full clean-out when he has the chance. Sorry, I could have noticed that and taken action myself.
ID: 4495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · 7 . . . 9 · Next

Message boards : CMS Application : Dip?


©2024 CERN