Thread 'CMS versions aborts'

Author	Message
zioriga Send message Joined: 28 Sep 15 Posts: 5 Credit: 599,506 RAC: 0	Message 3655 - Posted: 13 Jul 2016, 6:00:36 UTC Sometimes I try to run this subproject. But everytime the WU aborts after about 10 minuts of elapsed. http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=310 ID: 3655 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3656 - Posted: 13 Jul 2016, 7:22:53 UTC - in response to Message 3655. Hi zioriga. Did you ever run vbox tasks successfully? Do you have vt-x enabled in the BIOS? ID: 3656 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 28 Sep 15 Posts: 5 Credit: 599,506 RAC: 0	Message 3657 - Posted: 13 Jul 2016, 9:00:40 UTC yes ID: 3657 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 28 Sep 15 Posts: 5 Credit: 599,506 RAC: 0	Message 3658 - Posted: 13 Jul 2016, 9:02:17 UTC if Vt-x is not enabled, is impossible to install VirtualBox ID: 3658 · Rating: 0 · rate: / Reply Quote

Laurence CERN Project administrator Project developer Project tester Send message Joined: 12 Sep 14 Posts: 1156 Credit: 342,328 RAC: 0	Message 3659 - Posted: 13 Jul 2016, 9:19:23 UTC - in response to Message 3658. Last modified: 13 Jul 2016, 9:19:40 UTC Without Hardware-assisted virtualization you can install Virtual Box but can't run 64bit VMs. Please could you read this post and let me know if this solves the problem. ID: 3659 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 28 Sep 15 Posts: 5 Credit: 599,506 RAC: 0	Message 3660 - Posted: 13 Jul 2016, 9:47:14 UTC leomoon is only for Windows, my pc is a Intel 5960 with Linux mint ID: 3660 · Rating: 0 · rate: / Reply Quote

zioriga Send message Joined: 28 Sep 15 Posts: 5 Credit: 599,506 RAC: 0	Message 3662 - Posted: 13 Jul 2016, 11:35:12 UTC I corrected the errors in my settings Now it's working ID: 3662 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3784 - Posted: 23 Jul 2016, 12:12:06 UTC Last modified: 23 Jul 2016, 12:29:44 UTC Credentials not working? EDIT :Too many volunteers connected? Startup screen scroll (console) very slow. ID: 3784 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3785 - Posted: 23 Jul 2016, 14:07:12 UTC - in response to Message 3784. Seems to have fixed itself. ID: 3785 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,422,843 RAC: 1,348	Message 3786 - Posted: 23 Jul 2016, 15:13:36 UTC - in response to Message 3785. Last modified: 23 Jul 2016, 15:18:22 UTC As far as I know, the squid proxy we use to access the cvmfs data-base is capable enough that it wouldn't even notice the doubling (so far) of our volunteers' accesses to it. This may not be the case for the (hopefully small) number of machines which persist in accessing cvmfs directly rather than by the proxy server. We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. [Edit] Hmm, and you are one of the people using direct access: 2016-07-23 15:53:10 (3436): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE 2016-07-23 15:53:10 (3436): Guest Log: 2.2.0.0 3445 1 19616 3019 14 1 1254782 10240001 2 65024 0 17 94.1176 11873 1 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch DIRECT 1 [/Edit] ID: 3786 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3787 - Posted: 23 Jul 2016, 15:35:51 UTC I actually had a warning, project needs xxxxx MB you only have yyyyy. What effect would that have? Should it not have refused to start the boinc-task? The task seem to start fine. I was 500MB short. (and i had 1.3 cores selected in app_config), which it did not like. ID: 3787 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 3788 - Posted: 23 Jul 2016, 17:12:52 UTC - in response to Message 3786. Last modified: 23 Jul 2016, 17:25:20 UTC We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. Er... found this on one of mine:- http://cvmfs.fnal.gov/cvmfs/grid.cern.ch DIRECT 1 Edit:- It's not the only one. Is it a problem (to us, that is)? ID: 3788 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 786 Credit: 4,066,222 RAC: 0	Message 3789 - Posted: 23 Jul 2016, 20:55:47 UTC Saw this at the moment with thousands of lines: ------- End PYTHIA Flag + Mode + Parm + Word + FVec + MVec + PVec Settings ------------------------------------ -------- PYTHIA Particle Data Table (changed only) ------------------------------------------------------------------------------ id name antiName spn chg col m0 mWidth mMin mMax tau0 res dec ext vis wid no onMode bRatio meMode products no particle data has been changed from its default value -------- End PYTHIA Particle Data Table ----------------------------------------------------------------------------------------- EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table EvtGen:Could not decay:pi0 with mass:0 will throw event away! EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 EvtGen:Will take first kinematically allowed decay in the decay table ID: 3789 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,422,843 RAC: 1,348	Message 3790 - Posted: 23 Jul 2016, 21:09:07 UTC - in response to Message 3789. Yes, that's an unfortunate effect that I've not been able to completely eliminate, despite advice from one of the Pythia developers. I believe I've minimised its effect (and it affects the server more than it does your host!) but for now it's the best I can do. ID: 3790 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,422,843 RAC: 1,348	Message 3791 - Posted: 23 Jul 2016, 21:13:18 UTC - in response to Message 3788. We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do. Er... found this on one of mine:- http://cvmfs.fnal.gov/cvmfs/grid.cern.ch DIRECT 1 Edit:- It's not the only one. Is it a problem (to us, that is)? It's been associated with the big spike we had in job failures earlier this month. Laurence thought he'd found the reason and fixed it; the failures have abated so maybe a real problem has been eliminated, but I'm still curious as to why a supposedly standardised VM should bypass our designated squid proxy at will. ID: 3791 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 786 Credit: 4,066,222 RAC: 0	Message 3792 - Posted: 24 Jul 2016, 7:25:18 UTC Last modified: 24 Jul 2016, 7:25:45 UTC Thank you Ivan for your answer. Since 9 hours a new CMS in LHC-dev is running with normal cpu use, but no work is done. Sorry, it's sunday. Master-log show: 07/24/16 07:59:10 attempt to connect to <130.246.180.120:9623> failed: timed out after 20 seconds. 07/24/16 07:59:10 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed. 07/24/16 07:59:10 Failed to start non-blocking update to <130.246.180.120:9623>. 07/24/16 08:00:43 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 07/24/16 08:00:43 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 07/24/16 08:01:43 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#386234 ID: 3792 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,422,843 RAC: 1,348	Message 3793 - Posted: 24 Jul 2016, 10:55:21 UTC - in response to Message 3792. Last modified: 24 Jul 2016, 10:57:31 UTC Thank you Ivan for your answer. Since 9 hours a new CMS in LHC-dev is running with normal cpu use, but no work is done. Sorry, it's sunday. Master-log show: 07/24/16 07:59:10 attempt to connect to <130.246.180.120:9623> failed: timed out after 20 seconds. 07/24/16 07:59:10 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed. 07/24/16 07:59:10 Failed to start non-blocking update to <130.246.180.120:9623>. 07/24/16 08:00:43 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623 07/24/16 08:00:43 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds. 07/24/16 08:01:43 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#386234 OK, I'm not a Condor expert, but I get to shepherd our server lcggwms02. It's a bit like herding cats, really. :-/ It appears that the temporary failure to connect at 080043 was rectified by 080143, or am I missing something? FWIW, here's your finishing record on the last two job batches: [cms005@lcggwms02:~] > zgrep 'on 378-' 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out\| grep FINISHING 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.170.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sun Jul 17 12:32:27 GMT 2016 on 378-1165-30157 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4570.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 10:35:23 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4759.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 14:06:01 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4933.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 17:04:19 GMT 2016 on 378-1165-21378 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.7565.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 12:33:22 GMT 2016 on 378-1165-60 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8049.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 16:30:02 GMT 2016 on 378-1165-60 with (short) status 0 ======== 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8588.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 22:26:54 GMT 2016 on 378-1165-60 with (short) status 134 ======= [cms005@lcggwms02:~] > zgrep 'on 378-' 160722_133716:ireid_crab_BPH-RunIISummer15GS-00046_J/job_out\| grep FINISHING ...nothing... That doesn't look quite so good. Can you look in boincmgr, in Tasks tab, select task and click on the "Show VM console" button (I forget its exact text) and look at VM activity on the Alt-F3 console display? If it's not showing cmsRun as the top-running executable, I'd suggest aborting the task to see how a new one runs. Looking further, something went dramatically wrong in that last job with exit status 134 -- the end of the log just has these lines repeating: == CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away! == CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0 == CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table which may suggest a problem with the VM image (you're probably aware that a pi0 is not massless -- it may be its own antiparticle, but it does have mass!). If aborting the current task doesn't fix things, the next thing to try is to reset the project, which will download a fresh VM image. Best of luck! ID: 3793 · Rating: 0 · rate: / Reply Quote

maeax Send message Joined: 22 Apr 16 Posts: 786 Credit: 4,066,222 RAC: 0	Message 3794 - Posted: 24 Jul 2016, 17:18:55 UTC The task from last night ended today at 16:30 UTC with cobblestones. LHC-dev Project is resetted now and a new CMS-task is running. Console ALT+F3 show now cmsrun as first task. The cats are :-)). Thank you ID: 3794 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,422,843 RAC: 1,348	Message 3796 - Posted: 24 Jul 2016, 20:36:25 UTC - in response to Message 3794. I'm pleased. Of course... :-0! ID: 3796 · Rating: 0 · rate: / Reply Quote

Rasputin42 Volunteer tester Send message Joined: 16 Aug 15 Posts: 967 Credit: 1,216,795 RAC: 0	Message 3799 - Posted: 24 Jul 2016, 21:29:39 UTC I have actually tested running CMS tasks on the real cores rather than let it decide for itself, which ones to use. It is actually more than 10% faster. That is not surprising, but i thought, i mention it. ID: 3799 · Rating: 0 · rate: / Reply Quote

Development for LHC@home