Message boards : LHCb Application : New version v1.0
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 4336 - Posted: 18 Nov 2016, 9:13:46 UTC

This version now uses HTCondor the same as Theory and CMS. The image is on the larger side, 1.6GB compressed and 4GB uncompressed.
ID: 4336 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 4345 - Posted: 21 Nov 2016, 10:48:00 UTC

VM Completion Message: Condor exited after 50729s without running a job.


Ho can that be?


http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=287924
ID: 4345 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1010
Credit: 591,548
RAC: 0
Message 4346 - Posted: 21 Nov 2016, 14:16:24 UTC

Don't know where are the settings from: project prefs or app_config, but RAM looks too low for running a dual core LHCb.

2016-11-20 18:51:10 (32375): Setting Memory Size for VM. (1920MB)
2016-11-20 18:51:10 (32375): Setting CPU Count for VM. (2)
ID: 4346 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 4347 - Posted: 21 Nov 2016, 17:26:46 UTC - in response to Message 4346.  
Last modified: 21 Nov 2016, 17:29:14 UTC

If RAM is too low, the task should abort.
It ran all the way to the end, then stating, it did not run any jobs????

Even though i watched it upload things every couple of hours.

BTW disk-space requirement is enormous. 4GB for the image and another 6-6.5 GB for every task(boinc slot).
ID: 4347 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 4356 - Posted: 24 Nov 2016, 10:12:43 UTC
Last modified: 24 Nov 2016, 10:14:45 UTC

Condor exited after 51287s without running a job.


http://lhcathomedev.cern.ch/vLHCathome-dev/result.php?resultid=288158

I got the same error with a 4 core task and 5120MB of RAM.
How much memory does it need?(If that is, what is causing the problem.

One task succeeded with 6 GB of RAM.
ID: 4356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 965
Credit: 1,201,381
RAC: 0
Message 4361 - Posted: 27 Nov 2016, 13:16:18 UTC

Nearly every task, that run to the end (no manual shutdown) fails with:

Guest Log: [ERROR] Condor exited after 49816s without running a job.


Tasks ended by manual shutdown are valid.??????
ID: 4361 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5044 - Posted: 14 Jul 2017, 5:53:06 UTC

Since five days, 550 successful Vers. 1.00 and four with Errors (version 1.00).

One successful with Version 1.01 (multicore).

Preferences not changed since the beginning five days away.
ID: 5044 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5056 - Posted: 25 Jul 2017, 8:35:06 UTC - in response to Message 5044.  

Since five days, 550 successful Vers. 1.00 and four with Errors (version 1.00).


Now 15 days, 1550 successful Vers. 1.00 and five with Errors (only one more!)

Let it crunsh
ID: 5056 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 538
Credit: 7,547,069
RAC: 1,614
Message 5064 - Posted: 10 Aug 2017, 21:30:59 UTC

Glad I just happened to come inside and check my 9 computers since I still as of today was running LHCb Simulation v1.01 (vbox64_mt_mcore)
windows_x86_64 that I got earlier today.

Because I just caught the first 8-core starting to d/l this new .vdi 1.6GB


https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=356354

Can't be doing that X7 so I will switch back to maybe Theory tasks

(still over 2 months that the website is running like it is still June 1st)
ID: 5064 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5065 - Posted: 11 Aug 2017, 7:05:59 UTC

Saw this certificate-error:

Will update the output file pilot.cfg
PEM file has expired /etc/grid-security/hostcert.pem is not valid after 2017-06-30 09:21:27

2017-08-11 06:57:47 UTC INFO [Pilot] Command LHCbConfigureArchitecture instantiated from LHCbPilotCommands
2017-08-11 06:57:47 UTC INFO [LHCbConfigureArchitecture] Executing command dirac-architecture -o /DIRAC/Security/UseServerCertificate=yes pilot.cfg
PEM file has expired /etc/grid-security/hostcert.pem is not valid after 2017-06-30 09:21:27
Could not send mail via central Notification service Cannot get URL for Framework/Notification in setup LHCb-Production: RuntimeError('Option /DIRAC/Setups/LHCb-Production/Framework is not defined',)
ERROR: OS compatibility info not found

2017-08-11 06:57:49 UTC ERROR [LHCbConfigureArchitecture] There was an error updating the platform [ERROR 1]
2017-08-11 06:57:49 UTC INFO [LHCbConfigureArchitecture] List of child processes of current PID:
2017-08-11 06:57:49 UTC INFO [LHCbConfigureArchitecture] Executing command ps --forest -o pid,%cpu,%mem,tty,stat,time,cmd -g 4284
PID %CPU %MEM TT STAT TIME CMD
ID: 5065 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5066 - Posted: 11 Aug 2017, 7:11:37 UTC - in response to Message 5056.  

Now 30 days, 3k successful Vers. 1.00 and only seven with Errors!

Let it crunsh!
ID: 5066 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5067 - Posted: 11 Aug 2017, 7:30:42 UTC - in response to Message 5064.  

(still over 2 months that the website is running like it is still June 1st)


See it also, a reset of the dev-project don't refresh the contact of the Computer.
ID: 5067 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5106 - Posted: 28 Aug 2017, 9:59:13 UTC

Got yesterday a Multicore-Task at 10 UTC:

https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=347143

Have only in preferences Version 1.00.

Is it possible to upgrade the Version 1.00 to do more than TWO Jobs, maybe 5 or 10?
ID: 5106 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1021
Credit: 274,753
RAC: 0
Message 5107 - Posted: 29 Aug 2017, 7:50:03 UTC - in response to Message 5106.  

The VM will only accept new jobs for the first 12 hours. This VM started at 11:57:48 but shutdown at 12:13:08 due to a problem. It started again at 15:59:25. The first job ran for 6h 15m and the second for 6h 30m taking it over the 12 hour limit.
ID: 5107 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 538
Credit: 7,547,069
RAC: 1,614
Message 5109 - Posted: 31 Aug 2017, 6:11:15 UTC - in response to Message 5107.  
Last modified: 31 Aug 2017, 7:10:01 UTC

I have had that problem many times over at LHC with the Theory tasks.
DC_NOP failed!

And they will happen along with several Valids on the same host.

And THIS is always my major problem here with Theory tasks or over at LHC and I have run thousands of them (and was even tempted to pm you about this before I stopped by here)

AUTHENTICATE:1006:exceeded 1503828075 deadline during authentication
Guest Log: AUTHENTICATE:1004:Failed to authenticate using GSI
Guest Log: GSI:5004:Failed to authenticate. Globus is reporting error (17367040:0)
Guest Log: 08/27/17 12:00:48 recognized DC_NOP as command name, using command 60011.
2017-08-27 12:01:32 (11124): Guest Log: 08/27/17 12:01:15 condor_read(): timeout reading 5 bytes from local collector.
: Guest Log: 08/27/17 12:01:15 IO: Failed to read packet header
: Guest Log: 08/27/17 12:01:15 relisock_gsi_get (read from socket) failure
Guest Log: 08/27/17 12:01:15 Condor GSI authentication failure
: Guest Log: globus_gss_assist token :-1: read failure: Operation not permitted
: Guest Log: 08/27/17 12:01:15 SECMAN: required authentication with local collector failed, so aborting command DC_SEC_QUERY.
Guest Log: [ERROR] Could not ping HTCondor.


That *authentication failure* is ALWAYS because for whatever reason the Cern server will not authenticate the starting task and do that server handshake to give us/me *Credentials* and that HTCondor Ping so the task can then start running.

And the reason is the Cern server demands us to have an internet speed of close to 3Mbps or better because if it gets lower than that it is just pure luck if I/we get *Credentials* and then the HTCondor Ping

We talked about this before and you did change the 10 minute wall into 20 minutes which helped most of the time BUT I have to check the internet speed first and can NOT ever run these tasks on Auto

In fact I am loading them one at a time right now so I can get all of mine running again and have to watch the VM Console and see if they do or will start and get beyond the Credentials........if they don't then I know the internet speed is not fast enough so I have to wait until later......and of course after running these VB tasks over 6 years 24/7 I tend to never give up trying to keep these running.

I even changed from my slow DSL to a satellite dish which does give me high-speed UNTIL I use up my 80GB of data transfer I pay for.....well these Cern tasks burn that up in 6 or 7 days........so for the last 3 weeks of the month I have to try checking the speed all the time and usually until 4am daily.

There has to be a way to run these VB tasks without having to beg for Credentials just to make it to HTCondor Ping thousands of times for the thousands of tasks (and the alpha-test tasks are even harder to get started)

I realize we have hundreds of members and some are rookies so I guess they need to be checked all the time BUT I am not one of them.....in fact since I am the only one that never quit since T4T day one until the final seconds and all the other projects it is a bit annoying having to beg for Credentials and everything else on page 2 of the VB Console (of course I have many copies of those)

Imagine if we all still had a dialup.....none of the tasks would start unless we lived in Geneva and were directly connected to Cerns Servers.

So right now at 11pm I am trying to get my 6 tasks running (2-core and 3-core tasks) and then try to get the other 24 LHC Theory tasks to start one at a time until they are all running. (btw why are those LHC tasks STILL not running multi-core??).....I proved they work as far as Theory tasks........and CMS and LHCb
And this internet speed problem has NO problems when running SixTacks or GPU tasks since they don't use the dreaded VirtualBox and I can even unplug all of the ethernets when just running those.

Unless a cable company decides to run a cable down my long road and up my 300ft driveway I will never get a better speed than this for the entire month.

Ok I guess this is a long enough tale of internet speed

And as usual pray to the lead ion god that the server will not beg me for Credentials thousands of times

(since I am still in the middle of trying to start up the 24 LHC Theory tasks I figured I should add this snap shot since this is where the trouble begins since it will not get from page one of the VB Remote to this where the internet speed then decides if it will do this BEFORE the time is up)


ID: 5109 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 538
Credit: 7,547,069
RAC: 1,614
Message 5111 - Posted: 31 Aug 2017, 7:31:11 UTC

Part Two

As usual......I got the first 6 tasks of the 24 running in about 10mins each but then starting the next 3 tasks up (one on each of the three 8-cores) they hit that 20 minute wall and I had to abort all of them.

I did a speed test and it had dropped to 550Kps which means that no way will they get to the start of page 2 of the VM Console (never leaving page one)

And this is after midnight.......which means I still have 18 more tasks to TRY to get started so they will actually be running after I finally give up and fall asleep around 4am.......yes I am probably the only person in LHC history to do this. (for 13 years)

I need a permanent connection to the Cern servers or move my 5 acres on a big ship to Geneva

OK time to watch The Twilight Zone
ID: 5111 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5176 - Posted: 6 Oct 2017, 5:07:41 UTC

Since yesterday LHCb Vers. 1.00 crashed with X509:

2017-10-06 07:02:42 (2112): Guest Log: [DEBUG]
2017-10-06 07:02:42 (2112): Guest Log: ERROR: Couldn't read proxy from: /tmp/x509up_u0
2017-10-06 07:02:42 (2112): Guest Log: globus_credential: Error reading proxy credential
2017-10-06 07:02:42 (2112): Guest Log: globus_credential: Error reading proxy credential: Couldn't read PEM from bio
2017-10-06 07:02:42 (2112): Guest Log: OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line
2017-10-06 07:02:42 (2112): Guest Log: Use -debug for further information.
2017-10-06 07:02:42 (2112): Guest Log: [ERROR] Could not get an x509 credential
2017-10-06 07:02:48 (2112): Guest Log: [ERROR] The x509 proxy creation failed.
2017-10-06 07:02:48 (2112): Guest Log: [INFO] Shutting Down.
2017-10-06 07:02:48 (2112): VM Completion File Detected.
2017-10-06 07:02:48 (2112): VM Completion Message: The x509 proxy creation failed.
ID: 5176 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5180 - Posted: 7 Oct 2017, 9:53:45 UTC - in response to Message 5176.  

Since yesterday LHCb Vers. 1.00 crashed with X509:


This is now ok, but get no more Tasks for LHCb!
ID: 5180 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5181 - Posted: 8 Oct 2017, 8:35:19 UTC - in response to Message 5180.  

After manuell refresh, Tasks for LHCb (Vers.1.00) are running at the moment, Thank you.
ID: 5181 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 431
Credit: 1,260,319
RAC: 1
Message 5205 - Posted: 17 Oct 2017, 10:11:46 UTC

Two days ago, there was a constant flow of work for LHCb Vers. 1.00.
Now it loading down only after manuell refresh.
ID: 5205 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : LHCb Application : New version v1.0


©2020 CERN