Author | Message |
Laurence Project administrator Project developer Project tester
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129
|
Updates the CernVM cache and now uses OpenHTC.io, for CVMFS.
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
Thanks for updating.
What about this multicore-Theory in production.
Is there a timeline?
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
https://openhtc.io/
I will d/l this version later tonight on my dev computers but I wish I would know how big the vdi is going to be before doing this.
And yeah I have been wondering why these multi-cores have not already been moved to LHC since I have about 4000 Valids myself here.
They never fail. Mad Scientist For Life
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
vdi 293 MByte, 43 sec. download ;-)
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
This Sherpa is now in a loop:
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=1827692
Event 700 ( 1m 26s elapsed / 3h 23m 22s left ) -> ETA: Tue Jun 26 11:52
700 events processed
dumping histograms...
Event 800 ( 1m 35s elapsed / 3h 17m 50s left ) -> ETA: Tue Jun 26 11:46
800 events processed
dumping histograms...
Updating display...
Display update finished (9 histograms, 800 events).
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.595853, y = 0.50943).
Error in Splitting_Tools::ConstructKinematics(kt = -nan, z = 0.610011, y = 0.495165).
Event 900 ( 1m 47s elapsed / 3h 17m 35s left ) -> ETA: Tue Jun 26 11:46
900 events processed
dumping histograms...
Updating display...
Display update finished (9 histograms, 900 events).
|
|
Laurence Project administrator Project developer Project tester
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129
|
I will put it in production Wednesday or Thursday, once we have verified there are not issues here.
|
|
Laurence Project administrator Project developer Project tester
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129
|
Forgot to restart the server. Done now.
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
I only have 3 so far but maybe after the rest of the 3.07's are done I will get the rest of the new version.
All of my -dev computers are always set to auto-load so they would have had all with the new version if the server ........ Mad Scientist For Life
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
Ok I connected all my computers to Axel's ISP so I have the new version on all my -dev pc's now.
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
10k miles aircable RJ45 are needed.
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
10k miles aircable RJ45 are needed.
well I thought of doing that since it takes about as long to d/l those Atlas tasks as it would for me to roll out the 10k miles of cable but with these smaller Theory vdi's I decided to test my new Tesla wireless internet that bounces the signal off of clouds back down to Earth and hitting hard ground so it will bounce back up to the clouds and travel those 10k miles at close to the speed of light.
I was going to try my home made Linear particle accelerator but I thought that might melt all of our computers
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
Not much luck with these so far compared to the previous version.
Valids 9 - Errors 11 so far. (and 4 of those Valids are short one hour tasks each)
Most are the usual *VM Completion Message: Condor exited after 1036s without running a job*
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
Have one with defeats Status. Now running again.
There where network problems with Atlas-Downloads last night.
5 hours download and than stalled.
|
|
Laurence Project administrator Project developer Project tester
Send message Joined: 12 Sep 14 Posts: 1067 Credit: 329,589 RAC: 129
|
This is now available on the production project.
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
Laurence
thank you and also the Team.
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
|
|
maeax
Send message Joined: 22 Apr 16 Posts: 672 Credit: 1,900,822 RAC: 5,149
|
Your link is for the user: access denied!
But over the Computer-List the Message-list is shown.
You had no tasks for this to Theory-Tasks. Linux was started. Condor-ping was successful.
17.7 Benchmark HPSEC. Wow.
My Ryzen have 10 HPSEC.
Your Virtualbox is 5.1.22. For me 5.2.12 and Boinc 7.10.2.
Oracle have finished the support for 5.1.xx.
Will tune my Ryzen to see 17 HPSEC!!
|
|
Magic Quantum Mechanic
Send message Joined: 8 Apr 15 Posts: 754 Credit: 11,752,312 RAC: 9,092
|
Yeah I always forget these url's only work for the members after they log in........sure wish they had these so we could just use one url to show the members tasks page.
I just want to show how the 4 hosts I use here have task errors (I see a couple Valids this morning but there should be more)
I rather not post 10 url's to show 10 separate task stderr's
But mine aren't hidden so I guess I will just post one and say they are (or were) all doing this.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=1901631
5 Valids on July 1st so maybe some are just not finished yet today
(hey it is morning and I haven't went up the stairs to look at the desktops yet)
The previous version was not doing this all the time.
(I also haven't checked my LHC hosts yet this morning)
Here I have VB Version: 5.2.2 - 5.1.22 Mad Scientist For Life
|
|
Crystal Pellet Volunteer tester
Send message Joined: 13 Feb 15 Posts: 1185 Credit: 849,545 RAC: 1,472
|
This version does not handle resuming the VM from a snapshot very well.
Seen this now her at -dev and also on the production LHC@home.
Console Alt-F2 shows e.g. 52000 events processed, but there are no (nobody) processes busy (VM idling)
and the VM is also not killed by the shutdown file.
07/03/18 19:37:40 ******************************************************
07/03/18 19:37:40 ** condor_startd (CONDOR_STARTD) STARTING UP
07/03/18 19:37:40 ** /usr/sbin/condor_startd
07/03/18 19:37:40 ** SubsystemInfo: name=STARTD type=STARTD(7) class=DAEMON(1)
07/03/18 19:37:40 ** Configuration: subsystem:STARTD local:<NONE> class:DAEMON
07/03/18 19:37:40 ** $CondorVersion: 8.6.10 Mar 12 2018 BuildID: 435200 $
07/03/18 19:37:40 ** $CondorPlatform: x86_64_RedHat6 $
07/03/18 19:37:40 ** PID = 4089
07/03/18 19:37:40 ** Log last touched time unavailable (No such file or directory)
07/03/18 19:37:40 ******************************************************
07/03/18 19:37:40 Using config source: /etc/condor/condor_config
07/03/18 19:37:40 Using local config sources:
07/03/18 19:37:40 /etc/condor/config.d/10_security.config
07/03/18 19:37:40 /etc/condor/config.d/14_network.config
07/03/18 19:37:40 /etc/condor/config.d/20_workernode.config
07/03/18 19:37:40 /etc/condor/config.d/30_lease.config
07/03/18 19:37:40 /etc/condor/config.d/35_theory.config
07/03/18 19:37:40 /etc/condor/config.d/40_ccb.config
07/03/18 19:37:40 /etc/condor/config.d/62-benchmark.conf
07/03/18 19:37:40 /etc/condor/condor_config.local
07/03/18 19:37:40 config Macros = 160, Sorted = 160, StringBytes = 5549, TablesBytes = 5864
07/03/18 19:37:40 CLASSAD_CACHING is ENABLED
07/03/18 19:37:40 Daemon Log is logging: D_ALWAYS D_ERROR
07/03/18 19:37:40 Daemoncore: Listening at <10.0.2.15:37368> on TCP (ReliSock).
07/03/18 19:37:40 DaemonCore: command socket at <10.0.2.15:37368?addrs=10.0.2.15-37368&noUDP>
07/03/18 19:37:40 DaemonCore: private command socket at <10.0.2.15:37368?addrs=10.0.2.15-37368>
07/03/18 19:37:41 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
07/03/18 19:37:41 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#40450682
07/03/18 19:37:42 HibernationSupportedStates invalid '' in ad from hibernation plugin /usr/libexec/condor/condor_power_state
07/03/18 19:37:42 VM-gahp server reported an internal error
07/03/18 19:37:42 VM universe will be tested to check if it is available
07/03/18 19:37:42 History file rotation is enabled.
07/03/18 19:37:42 Maximum history file size is: 20971520 bytes
07/03/18 19:37:42 Number of rotated history files is: 2
07/03/18 19:37:42 Allocating auto shares for slot type 0: Cpus: auto, Memory: auto, Swap: auto, Disk: auto
slot type 0: Cpus: 1.000000, Memory: 1500, Swap: 100.00%, Disk: 100.00%
07/03/18 19:37:42 New machine resource allocated
07/03/18 19:37:42 Setting up slot pairings
07/03/18 19:37:42 CronJobList: Adding job 'multicore'
07/03/18 19:37:42 CronJob: Initializing job 'multicore' (/usr/local/bin/multicore-shutdown)
07/03/18 19:37:42 CronJobList: Adding job 'mips'
07/03/18 19:37:42 CronJobList: Adding job 'kflops'
07/03/18 19:37:42 CronJob: Initializing job 'mips' (/usr/libexec/condor/condor_mips)
07/03/18 19:37:42 CronJob: Initializing job 'kflops' (/usr/libexec/condor/condor_kflops)
07/03/18 19:37:42 State change: IS_OWNER is false
07/03/18 19:37:42 Changing state: Owner -> Unclaimed
07/03/18 19:37:42 State change: RunBenchmarks is TRUE
07/03/18 19:37:42 Changing activity: Idle -> Benchmarking
07/03/18 19:37:42 BenchMgr:StartBenchmarks()
07/03/18 19:37:45 Initial update sent to collector(s)
07/03/18 19:37:45 Sending DC_SET_READY message to master <10.0.2.15:55685?addrs=10.0.2.15-55685>
07/03/18 19:37:46 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
07/03/18 19:37:57 State change: benchmarks completed
07/03/18 19:37:57 Changing activity: Benchmarking -> Idle
07/03/18 19:38:16 Request accepted.
07/03/18 19:38:16 Remote owner is test4theory@cern.ch
07/03/18 19:38:16 State change: claiming protocol successful
07/03/18 19:38:16 Changing state: Unclaimed -> Claimed
07/03/18 19:38:17 Got activate_claim request from shadow (188.184.94.254)
07/03/18 19:38:17 Remote job ID is 387629.129
07/03/18 19:38:17 Got universe "VANILLA" (5) from request classad
07/03/18 19:38:17 State change: claim-activation protocol successful
07/03/18 19:38:17 Changing activity: Idle -> Busy
07/03/18 20:29:26 CCBListener: no activity from CCB server in 2169s; assuming connection is dead.
07/03/18 20:29:26 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/03/18 20:29:59 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 8, errno=104 Connection reset by peer
07/03/18 20:29:59 Buf::write(): condor_write() failed
07/03/18 20:30:27 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#40450682
07/03/18 20:36:09 condor_read() failed: recv(fd=9) returned -1, errno = 104 Connection reset by peer, reading 21 bytes from collector vccondor01.cern.ch.
07/03/18 20:36:09 IO: Failed to read packet header
07/03/18 20:36:09 CCBListener: failed to receive message from CCB server vccondor01.cern.ch
07/03/18 20:36:09 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/03/18 20:36:12 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer
07/03/18 20:36:12 Buf::write(): condor_write() failed
07/03/18 20:37:10 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#40450682
07/03/18 20:37:40 PERMISSION DENIED to condor@38-37-9348 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
07/03/18 20:37:40 DC_AUTHENTICATE: Command not authorized, done!
07/04/18 06:51:19 CCBListener: no activity from CCB server in 36249s; assuming connection is dead.
07/04/18 06:51:19 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/04/18 06:51:22 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer
07/04/18 06:51:22 Buf::write(): condor_write() failed
07/04/18 06:52:02 Starter pid 4124 exited with status 2
07/04/18 06:52:02 State change: starter exited
07/04/18 06:52:02 Changing activity: Busy -> Idle
07/04/18 06:52:02 State change: claim lease expired (condor_schedd gone?), evicting claim
07/04/18 06:52:02 Changing state and activity: Claimed/Idle -> Preempting/Killing
07/04/18 06:52:02 State change: No preempting claim, returning to owner
07/04/18 06:52:02 Changing state and activity: Preempting/Killing -> Owner/Idle
07/04/18 06:52:02 State change: IS_OWNER is false
07/04/18 06:52:02 Changing state: Owner -> Unclaimed
07/04/18 06:52:20 WARNING: forward resolution of 167.142.142.128.in-addr.arpa doesn't match 128.142.142.167!
07/04/18 06:52:20 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#40450682
07/04/18 06:52:33 Request accepted.
07/04/18 06:52:33 Remote owner is test4theory@cern.ch
07/04/18 06:52:33 State change: claiming protocol successful
07/04/18 06:52:33 Changing state: Unclaimed -> Claimed
07/04/18 06:52:33 Got activate_claim request from shadow (188.184.94.254)
07/04/18 06:52:33 Remote job ID is 429810.134
07/04/18 06:52:33 Got universe "VANILLA" (5) from request classad
07/04/18 06:52:33 State change: claim-activation protocol successful
07/04/18 06:52:33 Changing activity: Idle -> Busy
07/04/18 07:03:56 Called deactivate_claim_forcibly()
07/04/18 07:03:56 Starter pid 8323 exited with status 0
07/04/18 07:03:56 State change: starter exited
07/04/18 07:03:56 Changing activity: Busy -> Idle
07/04/18 07:03:56 Got activate_claim request from shadow (188.184.94.254)
07/04/18 07:03:56 Remote job ID is 429813.13
07/04/18 07:03:56 Got universe "VANILLA" (5) from request classad
07/04/18 07:03:56 State change: claim-activation protocol successful
07/04/18 07:03:56 Changing activity: Idle -> Busy
07/04/18 07:20:40 Called deactivate_claim_forcibly()
07/04/18 07:20:40 Starter pid 9949 exited with status 0
07/04/18 07:20:40 State change: starter exited
07/04/18 07:20:40 Changing activity: Busy -> Idle
07/04/18 07:20:40 Got activate_claim request from shadow (188.184.94.254)
07/04/18 07:20:40 Remote job ID is 429813.90
07/04/18 07:20:40 Got universe "VANILLA" (5) from request classad
07/04/18 07:20:40 State change: claim-activation protocol successful
07/04/18 07:20:40 Changing activity: Idle -> Busy
07/04/18 18:06:52 CCBListener: no activity from CCB server in 36071s; assuming connection is dead.
07/04/18 18:06:52 CCBListener: connection to CCB server vccondor01.cern.ch failed; will try to reconnect in 60 seconds.
07/04/18 18:06:57 condor_write(): Socket closed when trying to write 4096 bytes to collector vccondor01.cern.ch, fd is 6, errno=104 Connection reset by peer
07/04/18 18:06:57 Buf::write(): condor_write() failed
07/04/18 18:07:52 CCBListener: registered with CCB server vccondor01.cern.ch as ccbid 128.142.142.167:9618?addrs=128.142.142.167-9618+[2001-1458-301-98--100-99]-9618#40450682
07/04/18 20:37:40 PERMISSION DENIED to condor@38-37-9348 from host 10.0.2.15 for command 448 (GIVE_STATE), access level READ: reason: READ authorization policy contains no matching ALLOW entry for this request; identifiers used for this host: 10.0.2.15,10.0.2.15, hostname size = 1, original ip address = 10.0.2.15
07/04/18 20:37:40 DC_AUTHENTICATE: Command not authorized, done!
|
|