Message boards : CMS Application : CMS versions aborts
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
zioriga

Send message
Joined: 28 Sep 15
Posts: 5
Credit: 376,835
RAC: 2,498
Message 3655 - Posted: 13 Jul 2016, 6:00:36 UTC

Sometimes I try to run this subproject.
But everytime the WU aborts after about 10 minuts of elapsed.
http://lhcathomedev.cern.ch/vLHCathome-dev/results.php?userid=310
ID: 3655 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3656 - Posted: 13 Jul 2016, 7:22:53 UTC - in response to Message 3655.  

Hi zioriga.
Did you ever run vbox tasks successfully?
Do you have vt-x enabled in the BIOS?
ID: 3656 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zioriga

Send message
Joined: 28 Sep 15
Posts: 5
Credit: 376,835
RAC: 2,498
Message 3657 - Posted: 13 Jul 2016, 9:00:40 UTC

yes
ID: 3657 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zioriga

Send message
Joined: 28 Sep 15
Posts: 5
Credit: 376,835
RAC: 2,498
Message 3658 - Posted: 13 Jul 2016, 9:02:17 UTC

if Vt-x is not enabled, is impossible to install VirtualBox
ID: 3658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 3659 - Posted: 13 Jul 2016, 9:19:23 UTC - in response to Message 3658.  
Last modified: 13 Jul 2016, 9:19:40 UTC

Without Hardware-assisted virtualization you can install Virtual Box but can't run 64bit VMs. Please could you read this post and let me know if this solves the problem.
ID: 3659 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zioriga

Send message
Joined: 28 Sep 15
Posts: 5
Credit: 376,835
RAC: 2,498
Message 3660 - Posted: 13 Jul 2016, 9:47:14 UTC

leomoon is only for Windows, my pc is a Intel 5960 with Linux mint
ID: 3660 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
zioriga

Send message
Joined: 28 Sep 15
Posts: 5
Credit: 376,835
RAC: 2,498
Message 3662 - Posted: 13 Jul 2016, 11:35:12 UTC

I corrected the errors in my settings

Now it's working
ID: 3662 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3784 - Posted: 23 Jul 2016, 12:12:06 UTC
Last modified: 23 Jul 2016, 12:29:44 UTC

Credentials not working?

EDIT :Too many volunteers connected?

Startup screen scroll (console) very slow.
ID: 3784 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3785 - Posted: 23 Jul 2016, 14:07:12 UTC - in response to Message 3784.  

Seems to have fixed itself.
ID: 3785 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3786 - Posted: 23 Jul 2016, 15:13:36 UTC - in response to Message 3785.  
Last modified: 23 Jul 2016, 15:18:22 UTC

As far as I know, the squid proxy we use to access the cvmfs data-base is capable enough that it wouldn't even notice the doubling (so far) of our volunteers' accesses to it. This may not be the case for the (hopefully small) number of machines which persist in accessing cvmfs directly rather than by the proxy server. We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do.

[Edit] Hmm, and you are one of the people using direct access:
2016-07-23 15:53:10 (3436): Guest Log: VERSION PID UPTIME(M) MEM(K) REVISION EXPIRES(M) NOCATALOGS CACHEUSE(K) CACHEMAX(K) NOFDUSE NOFDMAX NOIOERR NOOPEN HITRATE(%) RX(K) SPEED(K/S) HOST PROXY ONLINE
2016-07-23 15:53:10 (3436): Guest Log: 2.2.0.0 3445 1 19616 3019 14 1 1254782 10240001 2 65024 0 17 94.1176 11873 1 http://cvmfs-stratum-one.cern.ch/cvmfs/grid.cern.ch DIRECT 1

[/Edit]
ID: 3786 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3787 - Posted: 23 Jul 2016, 15:35:51 UTC

I actually had a warning, project needs xxxxx MB you only have yyyyy.
What effect would that have? Should it not have refused to start the boinc-task? The task seem to start fine.
I was 500MB short.

(and i had 1.3 cores selected in app_config), which it did not like.
ID: 3787 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 874,518
RAC: 399
Message 3788 - Posted: 23 Jul 2016, 17:12:52 UTC - in response to Message 3786.  
Last modified: 23 Jul 2016, 17:25:20 UTC

We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do.

Er... found this on one of mine:-

http://cvmfs.fnal.gov/cvmfs/grid.cern.ch DIRECT 1

Edit:- It's not the only one.

Is it a problem (to us, that is)?
ID: 3788 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 3789 - Posted: 23 Jul 2016, 20:55:47 UTC

Saw this at the moment with thousands of lines:

*------- End PYTHIA Flag + Mode + Parm + Word + FVec + MVec + PVec Settings ------------------------------------*

-------- PYTHIA Particle Data Table (changed only) ------------------------------------------------------------------------------

id name antiName spn chg col m0 mWidth mMin mMax tau0 res dec ext vis wid
no onMode bRatio meMode products

no particle data has been changed from its default value

-------- End PYTHIA Particle Data Table -----------------------------------------------------------------------------------------

EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
EvtGen:Could not decay:pi0 with mass:0 will throw event away!
EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
EvtGen:Will take first kinematically allowed decay in the decay table
ID: 3789 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3790 - Posted: 23 Jul 2016, 21:09:07 UTC - in response to Message 3789.  

Yes, that's an unfortunate effect that I've not been able to completely eliminate, despite advice from one of the Pythia developers. I believe I've minimised its effect (and it affects the server more than it does your host!) but for now it's the best I can do.
ID: 3790 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3791 - Posted: 23 Jul 2016, 21:13:18 UTC - in response to Message 3788.  

We thought we'd changed our implementation to avoid direct access, but I've seen several hosts which still do.

Er... found this on one of mine:-

http://cvmfs.fnal.gov/cvmfs/grid.cern.ch DIRECT 1

Edit:- It's not the only one.

Is it a problem (to us, that is)?

It's been associated with the big spike we had in job failures earlier this month. Laurence thought he'd found the reason and fixed it; the failures have abated so maybe a real problem has been eliminated, but I'm still curious as to why a supposedly standardised VM should bypass our designated squid proxy at will.
ID: 3791 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 3792 - Posted: 24 Jul 2016, 7:25:18 UTC
Last modified: 24 Jul 2016, 7:25:45 UTC

Thank you Ivan for your answer.

Since 9 hours a new CMS in LHC-dev is running with normal cpu use,
but no work is done.

Sorry, it's sunday. Master-log show:

07/24/16 07:59:10 attempt to connect to <130.246.180.120:9623> failed: timed out after 20 seconds.
07/24/16 07:59:10 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed.
07/24/16 07:59:10 Failed to start non-blocking update to <130.246.180.120:9623>.
07/24/16 08:00:43 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
07/24/16 08:00:43 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
07/24/16 08:01:43 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#386234
ID: 3792 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3793 - Posted: 24 Jul 2016, 10:55:21 UTC - in response to Message 3792.  
Last modified: 24 Jul 2016, 10:57:31 UTC

Thank you Ivan for your answer.

Since 9 hours a new CMS in LHC-dev is running with normal cpu use,
but no work is done.

Sorry, it's sunday. Master-log show:

07/24/16 07:59:10 attempt to connect to <130.246.180.120:9623> failed: timed out after 20 seconds.
07/24/16 07:59:10 ERROR: SECMAN:2003:TCP connection to collector lcggwms02.gridpp.rl.ac.uk:9623 failed.
07/24/16 07:59:10 Failed to start non-blocking update to <130.246.180.120:9623>.
07/24/16 08:00:43 CCBListener: failed to receive message from CCB server lcggwms02.gridpp.rl.ac.uk:9623
07/24/16 08:00:43 CCBListener: connection to CCB server lcggwms02.gridpp.rl.ac.uk:9623 failed; will try to reconnect in 60 seconds.
07/24/16 08:01:43 CCBListener: registered with CCB server lcggwms02.gridpp.rl.ac.uk:9623 as ccbid 130.246.180.120:9623#386234

OK, I'm not a Condor expert, but I get to shepherd our server lcggwms02. It's a bit like herding cats, really. :-/
It appears that the temporary failure to connect at 080043 was rectified by 080143, or am I missing something?
FWIW, here's your finishing record on the last two job batches:
[cms005@lcggwms02:~] > zgrep 'on 378-' 160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out*| grep FINISHING
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.170.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sun Jul 17 12:32:27 GMT 2016 on 378-1165-30157 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4570.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 10:35:23 GMT 2016 on 378-1165-21378 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4759.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 14:06:01 GMT 2016 on 378-1165-21378 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.4933.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Thu Jul 21 17:04:19 GMT 2016 on 378-1165-21378 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.7565.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 12:33:22 GMT 2016 on 378-1165-60 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8049.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 16:30:02 GMT 2016 on 378-1165-60 with (short) status 0 ========
160716_182518:ireid_crab_BPH-RunIISummer15GS-00046_I/job_out.8588.0.txt.gz:======== gWMS-CMSRunAnalysis.sh FINISHING at Sat Jul 23 22:26:54 GMT 2016 on 378-1165-60 with (short) status 134 =======

[cms005@lcggwms02:~] > zgrep 'on 378-' 160722_133716:ireid_crab_BPH-RunIISummer15GS-00046_J/job_out*| grep FINISHING
...nothing...

That doesn't look quite so good. Can you look in boincmgr, in Tasks tab, select task and click on the "Show VM console" button (I forget its exact text) and look at VM activity on the Alt-F3 console display? If it's not showing cmsRun as the top-running executable, I'd suggest aborting the task to see how a new one runs.

Looking further, something went dramatically wrong in that last job with exit status 134 -- the end of the log just has these lines repeating:
== CMSSW: EvtGen:Could not decay:pi0 with mass:0 will throw event away!
== CMSSW: EvtGen:Tried 10000000 times to generate decay of pi0 with mass=0
== CMSSW: EvtGen:Will take first kinematically allowed decay in the decay table

which may suggest a problem with the VM image (you're probably aware that a pi0 is not massless -- it may be its own antiparticle, but it does have mass!). If aborting the current task doesn't fix things, the next thing to try is to reset the project, which will download a fresh VM image.
Best of luck!
ID: 3793 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 659
Credit: 1,719,912
RAC: 3,195
Message 3794 - Posted: 24 Jul 2016, 17:18:55 UTC

The task from last night ended today at 16:30 UTC with cobblestones.

LHC-dev Project is resetted now and a new CMS-task is running.

Console ALT+F3 show now cmsrun as first task.

The cats are :-)).

Thank you
ID: 3794 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 3796 - Posted: 24 Jul 2016, 20:36:25 UTC - in response to Message 3794.  

I'm pleased. Of course... :-0!
ID: 3796 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 3799 - Posted: 24 Jul 2016, 21:29:39 UTC

I have actually tested running CMS tasks on the real cores rather than let it decide for itself, which ones to use. It is actually more than 10% faster.
That is not surprising, but i thought, i mention it.
ID: 3799 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : CMS Application : CMS versions aborts


©2024 CERN