Message boards : Theory Application : Theory v.5.20
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7022 - Posted: 25 Apr 2020, 0:11:08 UTC

Well another one of those days........internet slow down since they are busy letting kids by the millions tweet and fakebook and netflix all day so I had several Running tasks turn to Failed (and continue running anyway) so I aborted them.

Then one pc got 60 new tasks (set not to get new tasks)
Good thing I checked or it would be running 8 Failed at a time

And a few changed to this



Yesterdays tasks ALL ran Valid with no problems but as usual this is the event generator testing of our internet speeds along with the LHC showers.

As I mentioned a few hundred times these tasks either Run or Fail in the first 3 minutes yet they still keep running when they Fail in the first 3 minutes......and it can't be that difficult to set them all to Abort when that happens not when they are ever caught by the members.
(and I don't mean by a long code for each OS for the average member to add to files to maybe fix that waste of time)

Only good thing was I did get to run a few Monte Carlo event generators we don't see often
ID: 7022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 7024 - Posted: 27 Apr 2020, 5:56:01 UTC

All Theory tasks make lots of HTTP requests to CVMFS repositories until the scientific app has been set up and starts calculating events (See ALT-F2).
After that until a task finishes there are small refresh requests to cernvm-prod.cern.ch/.cvmfspublished and sft.cern.ch/.cvmfspublished which happen regularly every 4 minutes.

Other repositories, e.g. grid.cern.ch or alice.cern.ch, are disconnected during this phase but may need to be reconnected at the end.
If internet response times are - even just temporarily - too high while a refresh or a reconnect happens this may cause a task to fail.
Typical message: "cvmfs_config probe xyz.cern.ch failed"

In most cases high response times happen on the 1st section between the home router and the ISP's connection node.
They usually occur if that section is busy with lots of other requests.
ID: 7024 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 7025 - Posted: 27 Apr 2020, 14:15:01 UTC - in response to Message 7024.  
Last modified: 27 Apr 2020, 14:21:51 UTC

Yes I know all of that and watch it happen since I watch every task start up and they have to do all of that in the first 3 minutes or they are another Failed task....they never make it to 4 minutes without getting all the way to the final line *cranky: [INFO] ==> [runRivet]* without being a Failed task since as you can see if you watch it happen it goes beyond the time limit ( I have watched thousands of these tasks start)

So when I start these and my monthly high-speed is used up then I can only start 2 at a time and then suspend them after between 4.5 and 5 minutes or more running so then they will restart without any problems.

I only have one pc plugged into the ethernet at a time doing this.......then when they are all started up and then suspended I can do the same on the next pc until I have them all ready (at least as many as I feel like doing)

But even when I am running at 30Mbps there still is no way to let them all just run and start on their own since there will be lots of those cvmfs_config probe xyz.cern.ch failed" tasks that will just keep on running until you happen to check the VM Console and see that and abort them.

I realize the average member would not do this at all and no magic wand or squid will work with a 2 way satellite connection will make this work on its own and this is nothing like d/ling a vdi.

I am never asking for help or reasons why.......I already know and did know when it was T4T but I do have to point out even if it is several times that Sherpa wastes hundreds of hours and seeing that here and over at LHC all the time it seems like they would not be sent all the time since no matter how many cores you have nobody likes having them run for 10 days for no reason.

As far as that internet speed problem I did at least a few years ago with CMS get the start time changed to 20 minutes just to get around that start-up problem but that isn't what we have with the Theory-dev.

You can see just by checking my tasks that once I get them started and then suspended I can then restart ALL of them and they will ALL be Valids.....well unless my Hughes satellite connection slows down to almost nothing for enough minutes to make them all crash into that same FAILED as I always find in the VM Console......and at times it is several hours after it happened since I can't just watch them all 24/7
Edit: and that "Failed *Getting time from pool*" starts in the first 20 seconds or less so it is also good to abort those since they do not stop running and will be Invalid

Same old story.....and I can't wait for those CMS to return here too.
oy vey 7am is too early for me to type this much
ID: 7025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : Theory Application : Theory v.5.20


©2024 CERN