Message boards :
CMS Application :
CMS versions aborts
Message board moderation
Author | Message |
---|---|
Send message Joined: 28 Sep 15 Posts: 5 Credit: 430,183 RAC: 908 |
Sometimes I try to run this subproject. But everytime the WU aborts after about 10 minuts of elapsed. http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=310 |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Hi zioriga. Did you ever run vbox tasks successfully? Do you have vt-x enabled in the BIOS? |
Send message Joined: 28 Sep 15 Posts: 5 Credit: 430,183 RAC: 908 |
yes |
Send message Joined: 28 Sep 15 Posts: 5 Credit: 430,183 RAC: 908 |
if Vt-x is not enabled, is impossible to install VirtualBox |
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,348 RAC: 230 |
Without Hardware-assisted virtualization you can install Virtual Box but can't run 64bit VMs. Please could you read this post and let me know if this solves the problem. |
Send message Joined: 28 Sep 15 Posts: 5 Credit: 430,183 RAC: 908 |
leomoon is only for Windows, my pc is a Intel 5960 with Linux mint |
Send message Joined: 28 Sep 15 Posts: 5 Credit: 430,183 RAC: 908 |
I corrected the errors in my settings Now it's working |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Credentials not working? EDIT :Too many volunteers connected? Startup screen scroll (console) very slow. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
Seems to have fixed itself. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
As far as I know, the squid proxy we use to access the cvmfs data-base is capable enough that it wouldn't even notice the doubling (so far) of our volunteers' accesses to it. This may not be the case for the (hopefully small) number of machines which persist in accessing cvmfs directly rather than by the proxy server. We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. [Edit] Hmm, and you are one of the people using direct access: 2016-07-23 15:53:10 (3436): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2016-07-23 15:53:10 (3436): Guest Log: 2.2.0.0 3445 1 19616 3019 14 1 1254782 10240001 2 65024 0 17 94.1176 11873 1 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch DIRECT 1 [/Edit] |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I actually had a warning, project needs xxxxx MB you only have yyyyy. What effect would that have? Should it not have refused to start the boinc-task? The task seem to start fine. I was 500MB short. (and i had 1.3 cores selected in app_config), which it did not like. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 223 |
We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. Er... found this on one of mine:- http://cvmfs.fnal.gov/cvmfs/grid.cern.ch DIRECT 1 Edit:- It's not the only one. Is it a problem (to us, that is)? |
Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,807,614 RAC: 2,394 |
Saw this at the moment with thousands of lines: *------- End PYTHIA Flag + Mode + Parm + Word + FVec + MVec + PVec Settings ------------------------------------* -------- PYTHIA Particle Data Table (changed only) ------------------------------------------------------------------------------ id name antiName spn chg col m0 mWidth mMin mMax tau0 res dec ext vis wid no onMode bRatio meMode products no particle data has been changed from its default value -------- End PYTHIA Particle Data Table ----------------------------------------------------------------------------------------- EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table EvtGen:Could not decay:pi0 with mass:0 will throw event away! EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Yes, that's an unfortunate effect that I've not been able to completely eliminate, despite advice from one of the Pythia developers. I believe I've minimised its effect (and it affects the server more than it does your host!) but for now it's the best I can do. |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. It's been associated with the big spike we had in job failures earlier this month. Laurence thought he'd found the reason and fixed it; the failures have abated so maybe a real problem has been eliminated, but I'm still curious as to why a supposedly standardised VM should bypass our designated squid proxy at will. |
Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,807,614 RAC: 2,394 |
Thank you Ivan for your answer. Since 9 hours a new CMS in LHC-dev is running with normal cpu use, but no work is done. Sorry, it's sunday. Master-log show: 07/24/16 07:59:10 attempt to connect to <130.246.180.120:9623> failed: timed out after 20 seconds. 07/24/16 07:59:10 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed. 07/24/16 07:59:10 Failed to start non-blocking update to <130.246.180.120:9623>. 07/24/16 08:00:43 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 07/24/16 08:00:43 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 07/24/16 08:01:43 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#386234 |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
Thank you Ivan for your answer. OK, I'm not a Condor expert, but I get to shepherd our server lcggwms02. It's a bit like herding cats, really. :-/ It appears that the temporary failure to connect at 080043 was rectified by 080143, or am I missing something? FWIW, here's your finishing record on the last two job batches: [cms005@lcggwms02:~] > zgrep 'on 378-' 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out*| grep FINISHING 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.170.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sun Jul 17 12:32:27 GMT 2016 on 378-1165-30157 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4570.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 10:35:23 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4759.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 14:06:01 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4933.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 17:04:19 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.7565.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 12:33:22 GMT 2016 on 378-1165-60 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8049.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 16:30:02 GMT 2016 on 378-1165-60 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8588.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 22:26:54 GMT 2016 on 378-1165-60 with (short) status 134 ======= [cms005@lcggwms02:~] > zgrep 'on 378-' 160722_133716:ireid_crab_BPH-RunIISummer15GS-00046_J/job_out*| grep FINISHING ...nothing... That doesn't look quite so good. Can you look in boincmgr, in Tasks tab, select task and click on the "Show VM console" button (I forget its exact text) and look at VM activity on the Alt-F3 console display? If it's not showing cmsRun as the top-running executable, I'd suggest aborting the task to see how a new one runs. Looking further, something went dramatically wrong in that last job with exit status 134 -- the end of the log just has these lines repeating: == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table which may suggest a problem with the VM image (you're probably aware that a pi0 is not massless -- it may be its own antiparticle, but it does have mass!). If aborting the current task doesn't fix things, the next thing to try is to reset the project, which will download a fresh VM image. Best of luck! |
Send message Joined: 22 Apr 16 Posts: 667 Credit: 1,807,614 RAC: 2,394 |
The task from last night ended today at 16:30 UTC with cobblestones. LHC-dev Project is resetted now and a new CMS-task is running. Console ALT+F3 show now cmsrun as first task. The cats are :-)). Thank you |
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,876,718 RAC: 266 |
I'm pleased. Of course... :-0! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have actually tested running CMS tasks on the real cores rather than let it decide for itself, which ones to use. It is actually more than 10% faster. That is not surprising, but i thought, i mention it. |
©2024 CERN