Message boards : CMS Application : New Version 50.00
Message board moderation

To post messages, you must log in.

Previous · 1 · 2

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,870,629
RAC: 576
Message 7184 - Posted: 14 May 2021, 9:29:01 UTC - in response to Message 7183.  

First investigations suggest that this is due to the condor jobs not matching VMs with more than one CPU, hence the familiar no jobs message. I'll need to get more specialists involved to find out when and why this changed -- it looks like it comes from the WMAgent side rather than CMS@home-dev itself

Ah, it's the infamous Ascension Day long weekend at CERN (it always catches me out) so I don't expect much response from anyone until Monday at the earliest.
ID: 7184 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7185 - Posted: 14 May 2021, 10:08:55 UTC

Thanks for the update Ivan

I will just keep running the single-core CMS here until I can try multi-core again.
( I also run a few over at the other site )
Mad Scientist For Life
ID: 7185 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
mikey

Send message
Joined: 18 Sep 16
Posts: 17
Credit: 682,881
RAC: 349
Message 7186 - Posted: 16 May 2021, 23:37:12 UTC - in response to Message 7183.  

First investigations suggest that this is due to the condor jobs not matching VMs with more than one CPU, hence the familiar no jobs message. I'll need to get more specialists involved to find out when and why this changed -- it looks like it comes from the WMAgent side rather than CMS@home-dev itself


Thank you very much, so far 3.5 hours in, it's still working using a single cpu core, in the past using 4 cpu cores it would have errored out long before this. Fingers crossed it finishes in the next guesstimate of 9 hours.
ID: 7186 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7187 - Posted: 16 May 2021, 23:57:21 UTC - in response to Message 7186.  
Last modified: 17 May 2021, 0:06:12 UTC

Yeah Mikey the single cores always have been working and you know why yours failed now.
I have over 100 Valids in a row with single-core in just the last 4 days
.(more than 200 this month)

They have worked fine for a few months but this version is supposed to be for testing multi-core which is why I let Ivan know about this ( and Laurence)

You can run as many single-cores as you like.
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192
Mad Scientist For Life
ID: 7187 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7188 - Posted: 18 May 2021, 4:33:19 UTC

Hmmmm well we ran out of CMS and now I see we have a reload and now I wonder if the multi-core problem was fixed.
I took sunday night off and was going to reload 24 of these tonight and get them all running when my high-speed starts at 2am.
Since I am anti-Invalids I think I will just run 24 singles and see if we have any news tomorrow.
Mad Scientist For Life
ID: 7188 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7189 - Posted: 18 May 2021, 9:39:41 UTC

Well it looks like the single core don't want to run now.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2969349

I didn't trust it so I watched this one before starting the rest the usual way and I guess I even save the high-speed internet for later and 2:30am is as long as I need to do this tonight.

CMS can never be trusted.......hundreds of Valids and then if you let them run on their own you can get hundreds of Invalids.
Mad Scientist For Life
ID: 7189 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7190 - Posted: 18 May 2021, 18:56:29 UTC

I tried another one and they won't get beyond here.



That usually takes about 2 minutes.
Mad Scientist For Life
ID: 7190 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7192 - Posted: 19 May 2021, 11:29:03 UTC

Well the mystery goes away again and I got another 24 tasks to run again as usual.
Single-cores since I still will have to try a single multi-core task with 2 cores to see if that was fixed before I try running 12 of those.
Mad Scientist For Life
ID: 7192 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7193 - Posted: 25 May 2021, 3:29:24 UTC

It looks like we are having a CMS single-core problem again and I stopped mine from getting more work and I see one of our hidden members are having the same problem and running more and hopefully will see this and suspend.

https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4424

Could be if somebody got around to making these run multi-core CMS so we an actually test some of those but we have no update here about that since Ivan stopped by and he sent them a message.

Maybe I will run some Theory ( and check over at Sixtrack to see if they have any CMS problem)
Mad Scientist For Life
ID: 7193 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7194 - Posted: 25 May 2021, 23:44:36 UTC

Oy
Glad I only tried one new CMS


I have seen these before and they usually run for an hour and then crash.
......back to Theory
ID: 7194 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7195 - Posted: 26 May 2021, 17:22:47 UTC

It looks like single-core CMS is working again.

Multi-core is another question ....I will give one another try.
ID: 7195 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7196 - Posted: 26 May 2021, 19:06:01 UTC - in response to Message 7195.  

ID: 7196 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7198 - Posted: 27 May 2021, 15:16:43 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970375

This may be just luck but I had a single-core CMS at about 5.5 hours when this laptop froze up and I had to turn it off and back on to get it to work again and for some reason it started running again just like they do when you first start them but in the log it still showed the first 5.5 hours and all the next 13+ hours ( CPU time 9 hours 45 min )
So they must be working again so I'll start up a new one.
ID: 7198 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
maeax

Send message
Joined: 22 Apr 16
Posts: 660
Credit: 1,720,123
RAC: 3,061
Message 7199 - Posted: 27 May 2021, 20:03:37 UTC - in response to Message 7196.  

Multi-core are still not running.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970119


Have testing with 2-Core, no subtasks for multicore:
2021-05-27 21:28:04 (10896): Guest Log: 05/27/21 21:27:56 **** condor_startd (condor_STARTD) pid 10794 EXITING WITH STATUS 0
2021-05-27 21:28:04 (10896): Guest Log: [ERROR] No jobs were available to run.
2021-05-27 21:28:04 (10896): Guest Log: [INFO] Shutting Down.
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970365
ID: 7199 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7201 - Posted: 3 Jun 2021, 16:51:14 UTC

A couple days ago I asked for the 2 tasks I run on this host and I got this

6/2/2021 2:12:13 AM | lhcathome-dev | Scheduler request failed: Failure when receiving data from the peer
6/2/2021 2:12:14 AM | | Project communication failed: attempting access to reference site
6/2/2021 2:12:17 AM | | Internet access OK - project servers may be temporarily down.

But when I check my account it said I did get 2 tasks but nothing was sent to me.
So I tried again and this time I did get 2 but now it says I have 4 tasks
I wonder where they actually went since after doing another update I get the usual 2 tasks but my account still says I have 4
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192

Maybe on the due date for those two on the 9th they will disappear or give me Timed out - no response
ID: 7201 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 738
Credit: 11,558,539
RAC: 1,940
Message 7202 - Posted: 21 Jun 2021, 10:45:21 UTC

11th year running VB tasks and still a pain in the ass



I can get 300 Valids in a row and then they just run when they feel like it and even nice enough to run 13 hours before crashing and no all of my computers didn't fail at the same time and my internet speed is over 30Mbps or more and I even only start up one or two at a time

And one started within 5 minutes of the other one

....almost 4am.....and this isn't a question
ID: 7202 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2

Message boards : CMS Application : New Version 50.00


©2024 CERN