Thread 'New version 49.00'

Author	Message
maeax Send message Joined: 22 Apr 16 Posts: 749 Credit: 2,976,995 RAC: 27,301	Message 6274 - Posted: 28 Mar 2019, 5:52:49 UTC 2019-03-27 15:20:07 (1392): Guest Log: VBoxService 5.2.6 r120293 (verbosity: 0) linux.amd64 (Jan 15 2018 14:51:00) release log Is it possible to upgrade to 5.2.8 (Official from Boinc) or last Version.5.2.26 for the next applet? ID: 6274 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6276 - Posted: 28 Mar 2019, 19:28:15 UTC Got 3 of these in a row Guest Log: [ERROR] Condor ended after 1153 seconds. Now 2 more running and since they are at 4hrs run time I hope they will continue to the end with jobs My other hosts are back running these again after a light-year of time to d/l this version (it seemed like it) ID: 6276 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6277 - Posted: 3 Apr 2019, 9:38:11 UTC - in response to Message 6276. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 Mad Scientist For Life ID: 6277 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 6278 - Posted: 3 Apr 2019, 20:04:50 UTC - in response to Message 6277. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 Link's not working for me; check your privacy settings. ID: 6278 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6279 - Posted: 5 Apr 2019, 21:44:41 UTC - in response to Message 6278. Last modified: 5 Apr 2019, 22:02:52 UTC (I NEVER have liked the damn way they use urls' on these boinc pages) After no problems here from the start with this new CMS version the last few days mostly problems and several versions. [ERROR] Could not connect to lhchomeproxy.cern.ch on port 3125 2019-04-05 12:43:02 (6004): Guest Log: [INFO] Shutting Down. 2019-04-05 12:43:02 (6004): VM Completion File Detected. 2019-04-05 12:43:02 (6004): VM Completion Message: Could not connect to lhchomeproxy.cern.ch on port 3125 [ERROR] Condor ended after 1216 seconds. 2019-04-04 18:39:16 (6000): Guest Log: [INFO] Shutting Down. 2019-04-04 18:39:16 (6000): VM Completion File Detected. 2019-04-04 18:39:16 (6000): VM Completion Message: Condor ended after 1216 seconds. and this mess https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2762565 Here is that snapshot where it looks like to me it as usual prefers a server that is a stones throw from Geneva and not across the Atlantic and then across North America since I watch them and see if they are fast (sort of) or typical slow........which should not be making it such a problem just to start a task considering we have to spend hours to d/l the tasks and the original vdi (no squids please) ID: 6279 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6281 - Posted: 7 Apr 2019, 0:21:11 UTC https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2765653 At least this one only wasted 8mins before 1 (0x00000001) Unknown error code as I watched the log and VM console run Same old thing......6mins for the server to get to - Theory application starting. Check log files. 00:09:01.102259 VMMDev: Guest Log: [DEBUG] HTCondor ping 00:09:07.889805 VMMDev: Guest Log: [DEBUG] 0 And then sit there waiting for the job to start in the slots long enough for it to end up an Error instead of VMMDev: Guest Log: [INFO] New Job Starting in slot1 ....on most of them and then several other Error versions. And since I have seen this hundreds of times it is just because the server does not go through all the same old checking of my computers fast enough and actually never just goes right to work. Sure would be nice if when I already have the files and vdi and tasks d/l here that they could just start running instead of hoping Cern can do the server hand shake and get to work instead of just sitting there.at Theory application starting. Check log files. VMMDev: Guest Log: [DEBUG] HTCondor ping VMMDev: Guest Log: [DEBUG] 0.....and waiting to start the jobs Here is the one I just watched as I am typing this...... two in a row the same thing https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2765653 and several more of these on another pc - [ERROR] Condor ended after 1272 seconds. Mad Scientist For Life ID: 6281 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6301 - Posted: 27 Apr 2019, 6:30:29 UTC Well since Ivan was getting all those Valid CMS tasks I decided to give it another try and got 2 Valids followed by 3 of these time wasting Errors https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2770411 I have 2 new ones started but I will have to set the manager back to no new tasks......and see if these 2 make it beyond 1hr 40mins. ID: 6301 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 6302 - Posted: 27 Apr 2019, 7:54:30 UTC - in response to Message 6301. You may have been unlucky with your timing. Since the resolution of the Easter problem ("quota exceeded", I've yet to get a full explanation) there have been a number of "unknown" entries showing up in the job plots. So, I submitted a new batch last night, waited until it showed up in my monitor, then aborted the previous batch -- this is also why there is a large "cancelled" peak in the plots. Let's wait a few more hours to see if things settle down, though I'm concerned that I'm still seeing unknowns in the plots. ID: 6302 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6303 - Posted: 27 Apr 2019, 9:57:02 UTC - in response to Message 6302. Ok Ivan I did get 2 more Invalids so I will try a couple more in my morning (noon) saturday ID: 6303 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6304 - Posted: 28 Apr 2019, 3:42:07 UTC No luck here Ivan Another 10 and probably soon 12 Errors so I will switch back to some Theory tasks and wait for this problem to be fixed here. ID: 6304 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1242 Credit: 964,086 RAC: 490	Message 6305 - Posted: 28 Apr 2019, 10:06:58 UTC - in response to Message 6304. Last modified: 28 Apr 2019, 10:51:05 UTC Another 10 and probably soon 12 Errors so I will switch back to some Theory tasks and wait for this problem to be fixed here. Picked up 1 task to test how it's going here. It's running fine. 4th job (40,000 events) inside the VM running. ID: 6305 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6306 - Posted: 28 Apr 2019, 22:09:40 UTC - in response to Message 6305. https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891941 It takes more than one CP ID: 6306 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 6307 - Posted: 28 Apr 2019, 22:27:42 UTC - in response to Message 6306. https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891941 It takes more than one CP That looks like (a small) progress to me. Did you change anything? I'm still perplexed as to why the job plots are showing so many "unknown" status jobs. For the record, WMAgent monitoring suggests no failures in Volunteer MonteCarlo jobs (i.e. no job has failed more than 3 (5?) attempts at running). OTOH, operations change the core submissions software quite often, they may have changed criteria... ID: 6307 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6308 - Posted: 28 Apr 2019, 23:24:15 UTC - in response to Message 6307. Last modified: 28 Apr 2019, 23:45:47 UTC https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891827 The very next task after that Valid one is another one of these. (no I didn't change anything) 36 Valids 31 Errors so far and usually the same error ID: 6308 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 6309 - Posted: 29 Apr 2019, 7:59:54 UTC - in response to Message 6308. https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891827 The very next task after that Valid one is another one of these. (no I didn't change anything) 36 Valids 31 Errors so far and usually the same error Looking at some of the timings suggests that something is hanging for quite a while, then after Condor is contacted it times out at ~20 minutes without returning a job. ID: 6309 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1242 Credit: 964,086 RAC: 490	Message 6310 - Posted: 29 Apr 2019, 8:48:41 UTC It's probably not a problem, but in the vdi coming from the project, the log-files MasterLog, StartLog and StarterLog exist and aren't empty. There are remnants of logging's from March 25th in it. ID: 6310 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6311 - Posted: 29 Apr 2019, 23:05:48 UTC https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2771779 After midnight PDT I started a CMS hoping the internet speed was running fast enough to get the job/slots started and watched it and when it was running several hours I knew it was going to complete and it just finished Valid Run time 12 hours 54 min 30 sec CPU time 14 hours 42 min 22 sec using 2 cores And noticed CP started a CMS 10 minutes after I did and his was a short one but also Valid https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2771626 So late tonight I will fire up two of the 2-core CMS and see if they are going to run ( you can see all about that in a pm to you over at LHC Ivan) And I took a couple new snapshots of this one last night just in case ( I probably have other ones in my stash) ID: 6311 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1145 Credit: 8,310,612 RAC: 0	Message 6312 - Posted: 30 Apr 2019, 2:06:18 UTC - in response to Message 6310. It's probably not a problem, but in the vdi coming from the project, the log-files MasterLog, StartLog and StarterLog exist and aren't empty. There are remnants of logging's from March 25th in it. Ah, so that's what that is! I had two tasks fail tonight and I noticed the strange time-stamps in the output. ID: 6312 · Rating: 0 · rate: / Reply Quote

Magic Quantum Mechanic Send message Joined: 8 Apr 15 Posts: 805 Credit: 14,339,071 RAC: 7,975	Message 6314 - Posted: 1 May 2019, 4:39:12 UTC Last modified: 1 May 2019, 4:40:34 UTC https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2772069 Well when I started two of the 2-core tasks after midnight where I knew the isp speed would be fast enough and both ran complete Valid tasks. But today I had to try starting a new one when the internet speed was not quite as fast and watched it run and after about 3 hours it finally stopped with the usual Guest Log: [ERROR] Condor ended after 8517 seconds This is always a problem connecting and communicating with the Cern server from where I am and I have watched thousands of logs and VB Console runs of these tasks. If the speed is fast enough (like when you live withing 500 miles of Geneva) these tasks start in just a few minutes but like this one I watched it take 28 minutes just to get to HTCondor Ping which is a sign that there is no way this will run a complete Valid and in this case running for 3 hours before it is an Error. Watching the VB Console I see Setting up install process epl/primary_db takes too long followed by the sl-security/primary_db taking too long with task run time over 20 )....and finally HTCondor Ping at 28mins and then run for another 2.5 hours I know every time that security check and the EPL File Extension run slow and finally finish that the task will never run Valid Just hangs up there and always end up the same. So I will do the same thing and start up a couple new ones after midnight (I always run a fast speed test just to be sure) ID: 6314 · Rating: 0 · rate: / Reply Quote

computezrmle Volunteer moderator Project tester Volunteer developer Volunteer tester Help desk expert Send message Joined: 28 Jul 16 Posts: 520 Credit: 400,710 RAC: 0	Message 6315 - Posted: 1 May 2019, 7:23:23 UTC - in response to Message 6314. Indeed a very poor download rate. This may partly be caused by an overloaded local internet connection, but especially in case of the EPEL downloads it could also be caused by a slow mirror server. LHC@home doesn't host the files shown in the graphics on their own servers. Instead they link to the public EPEL repository. The local downloader then pics one of the mirrors from this list: https://admin.fedoraproject.org/mirrormanager/mirrors/EPEL The mirror's bandwidth vary between poor 50 kbit/s and excellent 40000 kbit/s and for some unknown reason it's always a very slow mirror that is tried first (own experience). What can be done? 1. LHC@home could mirror the required files on their own systems and distribute them via fast networks, e.g. own CVMFS or openhtc.io. 2. Volunteers could configure their local firewall to reject connections to slow mirror servers. The local downloader would then immediately pic another mirror from the list. 3. Volunteers using a local proxy could do the following: 3.1 Reject connections to slow mirrors and to mirrors using HTTPS. 3.2 Pic 1 fast and reliable mirror from the list and rewrite all URLs to other mirrors to point to that fast mirror. As a result all requests would be served from the proxy cache except outdated files that would be downloaded from the remaining fast mirror. It should be mentioned that the total size of the EPEL downloads is tiny compared to the total downloads a CMS VM requires until it is completely set up. The total size is far more than 100 MB and nearly all of it could be served by a local proxy. ID: 6315 · Rating: 0 · rate: / Reply Quote

Development for LHC@home