Message boards : LHCb Application : New version v1.03
Message board moderation

To post messages, you must log in.

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 5487 - Posted: 3 Sep 2018, 9:09:11 UTC

CVMFS configuration improvements
ID: 5487 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 5488 - Posted: 3 Sep 2018, 13:02:41 UTC

Some comments regarding a 2-core WU that is currently in progress.


CVMFS
Works perfect together with openhtc.io as well as with the local proxy.
As a result the startup time is rather short.
2018-09-03 13:36:58 (9955): vboxwrapper (7.7.26196): starting
2018-09-03 13:38:16 (9955): Guest Log: [DEBUG] Detected squid proxy http://<hostname_censored_by_volunteer/>:3128
2018-09-03 13:39:23 (9955): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2018-09-03 13:39:23 (9955): Guest Log: 2.4.4.0 3529 1 25696 7069 2 1 1734417 10240001 2 65024 0 15 100 1 2 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch http://<local_IP_censored_by_volunteer/>:3128 1
2018-09-03 13:41:26 (9955): Guest Log: [INFO] New Job Starting in slot1
2018-09-03 13:41:26 (9955): Guest Log: [INFO] New Job Starting in slot2



Multicore Delay
A very short job in slot1 caused the same delay that is decribed in the non-dev message board:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4790&postid=36609
2018-09-03 13:50:36 (9955): Guest Log: [INFO] Job finished in slot1 with .
2018-09-03 14:01:46 (9955): Guest Log: [INFO] New Job Starting in slot1



VM's RAM Setting
Much lower than it is compared to the non-dev version.
Thus kswapd0 uses lots of CPU cycles (6 min within 1:15 runtime).
I suggest to set it higher for the final version or at least to give users with enough RAM a hint to tune the RAM setting via app_config.xml.
ID: 5488 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 5489 - Posted: 4 Sep 2018, 5:18:28 UTC

It looks like the LHCb VM uses the same default RAM size than the Theory VM (1-core: 730 MB; 2-core: 830 MB).
This is much too low as every single job inside the VM needs around 1.3 GB.
It causes lots of swapping activity and finally a crash:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362165


2 other tasks at 2432 MB (1-core) and 4864 MB (2-core) are running fine and are both close before the finish line:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362161
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362166
ID: 5489 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,716
RAC: 3,335
Message 5490 - Posted: 4 Sep 2018, 5:26:49 UTC

Thinking, that this two-Core LHCb is not running well, upgraded Boinc(7.12.1) and Virtualbox(5.2.18).
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362164
Windows 10pro is also upgrading, because of AMD-Meltdown corrections.
https://support.microsoft.com/en-us/help/4346783/windows-10-update-kb4346783
ID: 5490 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5491 - Posted: 4 Sep 2018, 9:19:06 UTC

I was not having any problems with the previous version but now they are trying to d/l another HUGE vdi so that will take many hours for each host I use here.

v1.03 is 968.76MB just for that vdi and d/l'ing at a SLOW 8.6kbps and this is after 2am here so that is when I have the fastest time on my end.

So far I see 2 host d/l'ing this at the same time and one is at 85% after over 21 HOURS

Ok after looking at host #3 it is at 77% after 15 HOURS

And the one I am on is at 24% after 14 HOURS

I hope that closest one get finished soon so maybe the other 2 will speed up.

One thing for sure is I won't be d/l'ing this on my other 8-core pc's I have running over at LHC
Mad Scientist For Life
ID: 5491 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 5492 - Posted: 5 Sep 2018, 7:39:21 UTC - in response to Message 5489.  

Yes this is using the theory plan class. We will need to create an LHCb. What are good values for the base memory and memory per cpu.
ID: 5492 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5493 - Posted: 5 Sep 2018, 7:45:38 UTC - in response to Message 5492.  
Last modified: 5 Sep 2018, 7:46:55 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192

Only 2 Valids and now nothing but these Invalids

09/05/18 09:21:28 recognized DC_NOP as command name, using command 60011.

2018-09-05 00:21:40 (3496): Guest Log: 09/05/18 09:21:40 Condor GSI authentication failure

2018-09-05 00:21:40 (3496): Guest Log: GSS Major Status: Authentication Failed

2018-09-05 00:21:40 (3496): Guest Log: GSS Minor Status Error Chain:

2018-09-05 00:21:40 (3496): Guest Log: globus_gss_assist: Error during context initialization

2018-09-05 00:21:40 (3496): Guest Log: globus_gsi_callback_module: Could not verify credential
(3496): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired

2018-09-05 00:21:40 (3496): Guest Log: 09/05/18 09:21:41 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.

2018-09-05 00:22:25 (3496): Guest Log: [ERROR] Could not ping HTCondor.
ID: 5493 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 5494 - Posted: 5 Sep 2018, 8:55:16 UTC - in response to Message 5492.  

Yes this is using the theory plan class. We will need to create an LHCb. What are good values for the base memory and memory per cpu.

A good starting point for a 1-core setup would be 2048 MB as this value is defined in the LHCb_2017_05_05.xml.
Maybe a bit more to avoid swapping.
My singlecore VMs at the production site use 2432 MB and swap out roughly 12 MB.

Recent jobs need up to 1.3 GB per core but IIRC there were jobs with larger requests in the past.
The project scientists should know what could be expected in the future.
That value should be added per additional core.
ID: 5494 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 5495 - Posted: 5 Sep 2018, 9:16:48 UTC - in response to Message 5494.  

I have added a plan class for LHCb to reflect those values. Please let me know how it goes.
ID: 5495 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5496 - Posted: 5 Sep 2018, 9:28:53 UTC

I guess I will have to add that I suspended all mine here and I have been running most of these LHCb multi's and they do not work with this new version AND you are welcome to check anyone elses errors with this new version because they are not working.

Since I have been running most of these I already have 30 of these Errors I mentioned and checking the few that are run by other members I see they are all the same Error.

I never had any problems as far as Ram with the previous version of these tasks. v1.02.......hundreds of those tasks worked with no problems.
Mad Scientist For Life
ID: 5496 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,985
RAC: 2,009
Message 5497 - Posted: 5 Sep 2018, 12:36:18 UTC - in response to Message 5495.  
Last modified: 5 Sep 2018, 13:05:16 UTC

I have added a plan class for LHCb to reflect those values. Please let me know how it goes.

The VM's are shutdown. => "Could not ping HTCondor"

Something wrong with ?? Invalid CRL: The available CRL has expired ??

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362317
ID: 5497 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5498 - Posted: 5 Sep 2018, 21:32:39 UTC

Sometimes I wonder if my posts ever get read here........I suggest that certain people pay attention to what is said by the member that does MOST of the testing here.

Not wait for a member that has only done one task of any of these new versions.
ID: 5498 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,716
RAC: 3,335
Message 5499 - Posted: 5 Sep 2018, 23:42:52 UTC

Is it possible to go back to the old .vdi 1.02?
ID: 5499 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5500 - Posted: 6 Sep 2018, 19:03:54 UTC - in response to Message 5499.  

Is it possible to go back to the old .vdi 1.02?


ID: 5500 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 5501 - Posted: 7 Sep 2018, 8:27:28 UTC - in response to Message 5499.  

Is it possible to go back to the old .vdi 1.02?

No. The purpose here is to test things rather than maintain a stable running service. We should be looking towards v1.04. The issue is this. The Certificate Revocation Lists (CRLs) are not being updated. When the VM was built they were current. Now they are stale but should be refreshed over CVMFS. The CVMFS configuration is therefore not broken but not working. We have to investigate. The purpose of this change is to move to openhtc.io like we have done for CMS and Theory.
ID: 5501 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,716
RAC: 3,335
Message 5502 - Posted: 7 Sep 2018, 8:33:18 UTC - in response to Message 5501.  
Last modified: 7 Sep 2018, 9:04:02 UTC

Thank you Laurence,
ID: 5502 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 5503 - Posted: 7 Sep 2018, 8:33:41 UTC - in response to Message 5498.  

Sometimes I wonder if my posts ever get read here........I suggest that certain people pay attention to what is said by the member that does MOST of the testing here.

Not wait for a member that has only done one task of any of these new versions.


Your posts always get read! But not all may get an answer. Yesterday was a public holiday.
ID: 5503 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5505 - Posted: 7 Sep 2018, 12:29:30 UTC - in response to Message 5503.  


Your posts always get read! But not all may get an answer. Yesterday was a public holiday.


Scroll back and see that I said this new version does not work in BOLD text and you posted here after and before that and said nothing in reply to that.

I posted it the first time minutes after you posted something and I started typing ALL of the facts 2 minutes after your post so you would see everything.

Then I said it again and you posted about something else.

The ones that should get an answer/reply are the ones you get from a member who does most of the testing here of those multi-core Theory and LHCb tasks not questions or statements from members who have not even been running these tasks.

And btw those stats pages STILL are not re[aired and so I have to go through the members one by one to find ANYONE else who might have run ANY of these tasks to make a comparison.

It as usual will say many members have been running these tasks yet if you take a look they have not been here at all for over a year and never ran one single multi core task as far as the Theory and LHCb or even CMS.

It is only two members doing that and I have done most of them yet that doesn't seem to be the tasks that get checked here to see if things are working.

I did thousands of the Theory multi-cores before they finally got moved over to LHC and I was the only one that did that and now 500 of these LHCb's Valid and working fine yet that doesn't seem to be how we test things here.

I am the only member that has been running all the testing for Cern since day one and 24/7 since VB started in 2011 including the Atlas-Alpha testing before they even came over here to -dev.

The stats pages here are still all wrong so I have to check everyone on the first page to see who actually is here running any of these LHCb's and the same when I did the thousands of Theory multi-cores.

Funny how it still has members who have not been here for over a year and never ran a single multi-core Theory or LHCb task up at the top of a stats page and I am willing to bet I am the only one that does these tasks and then digs through that stats list trying to find ONE single member here running these tasks while I do and the funny part is I asked him to run some so it wasn't just me doing this for a comparison........is that my job here too?

I am even typing this out at 5:30am so I know what time it is over there (and no I didn't get up early either) and I have never pulled "its a holiday or a vacation time" out of my pocket either.

So I will just check all 9 of my computers that I have running the same Theory multi-core at LHC and then finally get some sleep.
(and as usual the only problem there is at the Cern server end)

goodnight
Mad Scientist For Life
ID: 5505 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : LHCb Application : New version v1.03


©2024 CERN