41) Message boards : CMS Application : New version 49.00 (Message 6566)
Posted 19 Aug 2019 by Profile ivan
Post:
@#%$@%^&&^%$$#@*

Well as usual I get about 40 Valids in a row and then........

Might as well show you it isn't just me https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=1054

that evil server always picks the wrong night to do this too

Well in the morning (its 2:45am right now) I will be at the bowling alley and every pin will have a picture of my imaginary Cern server on it.

Sorry, there was a huge increase in CMS jobs being run last night, so the queue drained before I could replenish it. New batch sent, should be OK in a few minutes.
42) Message boards : CMS Application : New version 49.00 (Message 6563)
Posted 17 Aug 2019 by Profile ivan
Post:
Each of your failed VMs requested 4584MB RAM which is close to 60% of the computer's total RAM.
IIRC the default value of RAM a BOINC client allows it's task to use is 60%.
Did you try to increase the allowed RAM percentage?

Hah! Didn't think about that. It didn't like 3 cores either -- come to think of it, the error at one point was "waiting for memory"...
Yes, it was set to 50/90%, I changed to 90/90%.
43) Message boards : CMS Application : New version 49.00 (Message 6553)
Posted 15 Aug 2019 by Profile ivan
Post:
Well, my little £130 Celeron J1900 certainly didn't like trying to run a 4-core VM! Continually timed out "Waiting for memory"! I've dropped down to 3 cores to see if that runs.
44) Message boards : CMS Application : New version 49.00 (Message 6548)
Posted 14 Aug 2019 by Profile ivan
Post:
Ok I just ran all of those and they are all open (TRUE)

I never have had any blocked ports before so I knew that was not the problem and since these hosts ran many of these before I knew it wasn't a port problem and I do not even use a firewall since these are Cern only computers.

One did the 4 Valids and the other 2 pc's next to it on the same system failed.

You are welcome to check my stats and you can see that last month from the 22nd and before where many Valids and then they started failing for me and Ivan but then this month his started working again and mine would run for 5 or 6 hours before crashing and they all had been running jobs until that point.

I will run 4 more 2-core on the one host that did get Valids and just let the other get back to running Valid Theory VB's over at LHC

Edit: after these 4 mew tasks have been running between 30mins and 1 hour I see this in the VB log

Giving up catch-up attempt at a 60 047 182 552 ns lag; new total: 240 055 516 373 ns

I think my one Win10 failure was after I upgraded the memory to 8 GB because 4 GB wasn't quite enough and I guess the VM got confused when it restarted. I'm now running 1x 2-core VM on it with no apparent problem.
Googling that error message turns up some interesting things. At the moment your i7-3770 seems to be running 4x 2-core VMs (so you must have hyperthreading enabled); is that right? One of the comments I saw was that it's best to keep one core free to run the VM. Others suggested time-outs to slow peripheral storage, mismatch with Guest Addition modules, and a few other more exotic things.
45) Message boards : News : CMS@Home: Disruption to our condor server next Monday (Message 6519)
Posted 7 Aug 2019 by Profile ivan
Post:
There's been a slight change in plans.
"Given that we do not need to redeploy the agent, but only kill jobs in condor and let them get recreated with the JobSubmitter/schedd changes, I think you can go ahead and submit another workflow to [keep] volunteers happy."
So, I'll continue to submit smaller batches and you can resume new tasks.
46) Message boards : News : CMS@Home: Disruption to our condor server next Monday (Message 6518)
Posted 7 Aug 2019 by Profile ivan
Post:
We've run down the CMS job queue to make some changes to the submission environment. Please set No New Tasks so that you don't have excessive churning waiting for jobs that are not available.
47) Message boards : News : CMS@Home: Disruption to our condor server next Monday (Message 6509)
Posted 1 Aug 2019 by Profile ivan
Post:
OK, we eventually found the problem and jobs are flowing again.
48) Message boards : CMS Application : New version 49.00 (Message 6503)
Posted 25 Jul 2019 by Profile ivan
Post:
Tried another test this morning and still these will not start because of the Cern server problems.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2792997

Yes, we're having condor problems. WMStats shows jobs pending, but the condor schedd isn't sending any out. Must be a ClassAd mismatch that arose since the reboots.
49) Message boards : Cafe : LHC asleep all day (Message 6494)
Posted 23 Jul 2019 by Profile ivan
Post:
Good luck with your house projects Ivan and hope your ISP gives you a new modem without making you pay too.
(you must have a cell phone to be online....I never have owned one but it seems like everyone else in the world does)

Thanks. It's what they call a "semi-detached" built by the Council in 1967. I should have had it renovated when I moved in 15 years ago, but as you say, over 60 you start losing enthusiasm for hard work (it's not the bending down that's the problem, it's the straightening back up that hurts). I've never owned a cell/mobile/handy either. I leave the ringer on the landline switched off to avoid chimney-sweeps touting for business...
50) Message boards : Cafe : LHC asleep all day (Message 6492)
Posted 23 Jul 2019 by Profile ivan
Post:
At least you're connected. My broadband modem died in last Saturday's flood, so no news, no BBC iPlayer, no crossword, limited Android games. That's why I'm in my office before 7am. I also had to cancel a 2-day CMS-UK meeting in Oxfordshire while I try to work out a way out of the mess.
51) Message boards : News : CMS@Home: Disruption to our condor server next Monday (Message 6489)
Posted 22 Jul 2019 by Profile ivan
Post:
I have jobs coming in from -dev now, but none from the main project yet -- it still has a lot of services, including the feeder, showing as not running.

Hmm, but some of them are dying, VirtualBox is reporting inaccessible vdis.
52) Message boards : News : CMS@Home: Disruption to our condor server next Monday (Message 6442)
Posted 17 Jul 2019 by Profile ivan
Post:
https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5087#39376
53) Message boards : News : Using a local proxy to reduce network traffic for CMS (Message 6397)
Posted 7 Jun 2019 by Profile ivan
Post:
Thanks to computezrmle, with additional work from Laurence and a couple of CMS experts (and my adding one line to the site-local-config file) there is now a way to set up a local caching proxy to greatly reduce your network traffic. Each job instance that runs within s CMS BOINC task must retrieve a lot of set-up data from our database. This data doesn't change very often, so if you keep a local copy the job can access that rather than going over the network every time.
Instructions on how to do this are available at https://lhcathomedev.cern.ch/lhcathome-dev/forum_thread.phpp?id=475&postid=6396 or https://lhcathome.cern.ch/lhcathome/forum_thread.php?id=5052&postid=39072
54) Message boards : News : CMS -- Please set "no new tasks" (Message 6368)
Posted 15 May 2019 by Profile ivan
Post:
Intervention over, CMS jobs are available again
55) Message boards : News : CMS -- Please set "no new tasks" (Message 6367)
Posted 15 May 2019 by Profile ivan
Post:
Yes, it was OK to let running tasks finish (sorry I only saw your message just now). I let the queue run down to less than 30 "running" tasks before aborting the workflow and giving Alan free rein to update the WMAgent. As yet it hasn't re-appeared and he hasn't let me know the update is over. Once I know that I'll submit a new workflow and when it appears on WMStats I'll issue the go-ahead to resume tasks.
56) Message boards : News : CMS -- Please set "no new tasks" (Message 6365)
Posted 14 May 2019 by Profile ivan
Post:
Hi to all CMS-ers. We need to drain the job queue so that a new version of the WMAgent can be installed.
Can you please set No New Tasks so that your current tasks can run out and no new jobs start? If you have any tasks waiting to run, please suspend or abort them.
Thanks, I'll let you know as soon as the change is done.
57) Message boards : CMS Application : New version 49.00 (Message 6312)
Posted 30 Apr 2019 by Profile ivan
Post:
It's probably not a problem, but in the vdi coming from the project, the log-files MasterLog, StartLog and StarterLog exist and aren't empty.
There are remnants of logging's from March 25th in it.

Ah, so that's what that is! I had two tasks fail tonight and I noticed the strange time-stamps in the output.
58) Message boards : CMS Application : New version 49.00 (Message 6309)
Posted 29 Apr 2019 by Profile ivan
Post:
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891827

The very next task after that Valid one is another one of these.

(no I didn't change anything)

36 Valids 31 Errors so far and usually the same error

Looking at some of the timings suggests that something is hanging for quite a while, then after Condor is contacted it times out at ~20 minutes without returning a job.
59) Message boards : CMS Application : New version 49.00 (Message 6307)
Posted 28 Apr 2019 by Profile ivan
Post:
https://lhcathomedev.cern.ch/lhcathome-dev/workunit.php?wuid=1891941

It takes more than one CP

That looks like (a small) progress to me. Did you change anything?
I'm still perplexed as to why the job plots are showing so many "unknown" status jobs. For the record, WMAgent monitoring suggests no failures in Volunteer MonteCarlo jobs (i.e. no job has failed more than 3 (5?) attempts at running). OTOH, operations change the core submissions software quite often, they may have changed criteria...
60) Message boards : CMS Application : New version 49.00 (Message 6302)
Posted 27 Apr 2019 by Profile ivan
Post:
You *may* have been unlucky with your timing. Since the resolution of the Easter problem ("quota exceeded", I've yet to get a full explanation) there have been a number of "unknown" entries showing up in the job plots. So, I submitted a new batch last night, waited until it showed up in my monitor, then aborted the previous batch -- this is also why there is a large "cancelled" peak in the plots.
Let's wait a few more hours to see if things settle down, though I'm concerned that I'm still seeing unknowns in the plots.


Previous 20 · Next 20


©2024 CERN