Message boards :
LHCb Application :
New version v1.03
Message board moderation
Author | Message |
---|---|
Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158 |
CVMFS configuration improvements |
Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 34 |
Some comments regarding a 2-core WU that is currently in progress. CVMFS Works perfect together with openhtc.io as well as with the local proxy. As a result the startup time is rather short. 2018-09-03 13:36:58 (9955): vboxwrapper (7.7.26196): starting 2018-09-03 13:38:16 (9955): Guest Log: [DEBUG] Detected squid proxy http://<hostname_censored_by_volunteer/>:3128 2018-09-03 13:39:23 (9955): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2018-09-03 13:39:23 (9955): Guest Log: 2.4.4.0 3529 1 25696 7069 2 1 1734417 10240001 2 65024 0 15 100 1 2 http://s1cern-cvmfs.openhtc.io/cvmfs/grid.cern.ch http://<local_IP_censored_by_volunteer/>:3128 1 2018-09-03 13:41:26 (9955): Guest Log: [INFO] New Job Starting in slot1 2018-09-03 13:41:26 (9955): Guest Log: [INFO] New Job Starting in slot2 Multicore Delay A very short job in slot1 caused the same delay that is decribed in the non-dev message board: https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=4790&postid=36609 2018-09-03 13:50:36 (9955): Guest Log: [INFO] Job finished in slot1 with . 2018-09-03 14:01:46 (9955): Guest Log: [INFO] New Job Starting in slot1 VM's RAM Setting Much lower than it is compared to the non-dev version. Thus kswapd0 uses lots of CPU cycles (6 min within 1:15 runtime). I suggest to set it higher for the final version or at least to give users with enough RAM a hint to tune the RAM setting via app_config.xml. |
Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 34 |
It looks like the LHCb VM uses the same default RAM size than the Theory VM (1-core: 730 MB; 2-core: 830 MB). This is much too low as every single job inside the VM needs around 1.3 GB. It causes lots of swapping activity and finally a crash: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362165 2 other tasks at 2432 MB (1-core) and 4864 MB (2-core) are running fine and are both close before the finish line: https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362161 https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362166 |
Send message Joined: 22 Apr 16 Posts: 664 Credit: 1,807,614 RAC: 2,394 |
Thinking, that this two-Core LHCb is not running well, upgraded Boinc(7.12.1) and Virtualbox(5.2.18). https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362164 Windows 10pro is also upgrading, because of AMD-Meltdown corrections. https://support.microsoft.com/en-us/help/4346783/windows-10-update-kb4346783 |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
I was not having any problems with the previous version but now they are trying to d/l another HUGE vdi so that will take many hours for each host I use here. v1.03 is 968.76MB just for that vdi and d/l'ing at a SLOW 8.6kbps and this is after 2am here so that is when I have the fastest time on my end. So far I see 2 host d/l'ing this at the same time and one is at 85% after over 21 HOURS Ok after looking at host #3 it is at 77% after 15 HOURS And the one I am on is at 24% after 14 HOURS I hope that closest one get finished soon so maybe the other 2 will speed up. One thing for sure is I won't be d/l'ing this on my other 8-core pc's I have running over at LHC Mad Scientist For Life |
Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158 |
Yes this is using the theory plan class. We will need to create an LHCb. What are good values for the base memory and memory per cpu. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 Only 2 Valids and now nothing but these Invalids 09/05/18 09:21:28 recognized DC_NOP as command name, using command 60011. 2018-09-05 00:21:40 (3496): Guest Log: 09/05/18 09:21:40 Condor GSI authentication failure 2018-09-05 00:21:40 (3496): Guest Log: GSS Major Status: Authentication Failed 2018-09-05 00:21:40 (3496): Guest Log: GSS Minor Status Error Chain: 2018-09-05 00:21:40 (3496): Guest Log: globus_gss_assist: Error during context initialization 2018-09-05 00:21:40 (3496): Guest Log: globus_gsi_callback_module: Could not verify credential (3496): Guest Log: globus_gsi_callback_module: Invalid CRL: The available CRL has expired 2018-09-05 00:21:40 (3496): Guest Log: 09/05/18 09:21:41 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY. 2018-09-05 00:22:25 (3496): Guest Log: [ERROR] Could not ping HTCondor. |
Send message Joined: 28 Jul 16 Posts: 473 Credit: 389,411 RAC: 34 |
Yes this is using the theory plan class. We will need to create an LHCb. What are good values for the base memory and memory per cpu. A good starting point for a 1-core setup would be 2048 MB as this value is defined in the LHCb_2017_05_05.xml. Maybe a bit more to avoid swapping. My singlecore VMs at the production site use 2432 MB and swap out roughly 12 MB. Recent jobs need up to 1.3 GB per core but IIRC there were jobs with larger requests in the past. The project scientists should know what could be expected in the future. That value should be added per additional core. |
Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158 |
I have added a plan class for LHCb to reflect those values. Please let me know how it goes. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
I guess I will have to add that I suspended all mine here and I have been running most of these LHCb multi's and they do not work with this new version AND you are welcome to check anyone elses errors with this new version because they are not working. Since I have been running most of these I already have 30 of these Errors I mentioned and checking the few that are run by other members I see they are all the same Error. I never had any problems as far as Ram with the previous version of these tasks. v1.02.......hundreds of those tasks worked with no problems. Mad Scientist For Life |
Send message Joined: 13 Feb 15 Posts: 1180 Credit: 815,336 RAC: 238 |
I have added a plan class for LHCb to reflect those values. Please let me know how it goes. The VM's are shutdown. => "Could not ping HTCondor" Something wrong with ?? Invalid CRL: The available CRL has expired ?? https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2362317 |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
Sometimes I wonder if my posts ever get read here........I suggest that certain people pay attention to what is said by the member that does MOST of the testing here. Not wait for a member that has only done one task of any of these new versions. |
Send message Joined: 22 Apr 16 Posts: 664 Credit: 1,807,614 RAC: 2,394 |
Is it possible to go back to the old .vdi 1.02? |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
Is it possible to go back to the old .vdi 1.02? |
Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158 |
Is it possible to go back to the old .vdi 1.02? No. The purpose here is to test things rather than maintain a stable running service. We should be looking towards v1.04. The issue is this. The Certificate Revocation Lists (CRLs) are not being updated. When the VM was built they were current. Now they are stale but should be refreshed over CVMFS. The CVMFS configuration is therefore not broken but not working. We have to investigate. The purpose of this change is to move to openhtc.io like we have done for CMS and Theory. |
Send message Joined: 22 Apr 16 Posts: 664 Credit: 1,807,614 RAC: 2,394 |
Thank you Laurence, |
Send message Joined: 12 Sep 14 Posts: 1064 Credit: 328,405 RAC: 158 |
Sometimes I wonder if my posts ever get read here........I suggest that certain people pay attention to what is said by the member that does MOST of the testing here. Your posts always get read! But not all may get an answer. Yesterday was a public holiday. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,609,314 RAC: 1,490 |
Scroll back and see that I said this new version does not work in BOLD text and you posted here after and before that and said nothing in reply to that. I posted it the first time minutes after you posted something and I started typing ALL of the facts 2 minutes after your post so you would see everything. Then I said it again and you posted about something else. The ones that should get an answer/reply are the ones you get from a member who does most of the testing here of those multi-core Theory and LHCb tasks not questions or statements from members who have not even been running these tasks. And btw those stats pages STILL are not re[aired and so I have to go through the members one by one to find ANYONE else who might have run ANY of these tasks to make a comparison. It as usual will say many members have been running these tasks yet if you take a look they have not been here at all for over a year and never ran one single multi core task as far as the Theory and LHCb or even CMS. It is only two members doing that and I have done most of them yet that doesn't seem to be the tasks that get checked here to see if things are working. I did thousands of the Theory multi-cores before they finally got moved over to LHC and I was the only one that did that and now 500 of these LHCb's Valid and working fine yet that doesn't seem to be how we test things here. I am the only member that has been running all the testing for Cern since day one and 24/7 since VB started in 2011 including the Atlas-Alpha testing before they even came over here to -dev. The stats pages here are still all wrong so I have to check everyone on the first page to see who actually is here running any of these LHCb's and the same when I did the thousands of Theory multi-cores. Funny how it still has members who have not been here for over a year and never ran a single multi-core Theory or LHCb task up at the top of a stats page and I am willing to bet I am the only one that does these tasks and then digs through that stats list trying to find ONE single member here running these tasks while I do and the funny part is I asked him to run some so it wasn't just me doing this for a comparison........is that my job here too? I am even typing this out at 5:30am so I know what time it is over there (and no I didn't get up early either) and I have never pulled "its a holiday or a vacation time" out of my pocket either. So I will just check all 9 of my computers that I have running the same Theory multi-core at LHC and then finally get some sleep. (and as usual the only problem there is at the Cern server end) goodnight Mad Scientist For Life |
©2024 CERN