Message boards :
CMS Application :
New Version 50.00
Message board moderation
Previous · 1 · 2
Author | Message |
---|---|
Send message Joined: 20 Jan 15 Posts: 1129 Credit: 7,901,648 RAC: 2,120 |
First investigations suggest that this is due to the condor jobs not matching VMs with more than one CPU, hence the familiar no jobs message. I'll need to get more specialists involved to find out when and why this changed -- it looks like it comes from the WMAgent side rather than CMS@home-dev itself Ah, it's the infamous Ascension Day long weekend at CERN (it always catches me out) so I don't expect much response from anyone until Monday at the earliest. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Thanks for the update Ivan I will just keep running the single-core CMS here until I can try multi-core again. ( I also run a few over at the other site ) Mad Scientist For Life |
Send message Joined: 18 Sep 16 Posts: 17 Credit: 707,698 RAC: 352 |
First investigations suggest that this is due to the condor jobs not matching VMs with more than one CPU, hence the familiar no jobs message. I'll need to get more specialists involved to find out when and why this changed -- it looks like it comes from the WMAgent side rather than CMS@home-dev itself Thank you very much, so far 3.5 hours in, it's still working using a single cpu core, in the past using 4 cpu cores it would have errored out long before this. Fingers crossed it finishes in the next guesstimate of 9 hours. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Yeah Mikey the single cores always have been working and you know why yours failed now. I have over 100 Valids in a row with single-core in just the last 4 days .(more than 200 this month) They have worked fine for a few months but this version is supposed to be for testing multi-core which is why I let Ivan know about this ( and Laurence) You can run as many single-cores as you like. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Hmmmm well we ran out of CMS and now I see we have a reload and now I wonder if the multi-core problem was fixed. I took sunday night off and was going to reload 24 of these tonight and get them all running when my high-speed starts at 2am. Since I am anti-Invalids I think I will just run 24 singles and see if we have any news tomorrow. Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Well it looks like the single core don't want to run now. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2969349 I didn't trust it so I watched this one before starting the rest the usual way and I guess I even save the high-speed internet for later and 2:30am is as long as I need to do this tonight. CMS can never be trusted.......hundreds of Valids and then if you let them run on their own you can get hundreds of Invalids. Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
I tried another one and they won't get beyond here. That usually takes about 2 minutes. Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Well the mystery goes away again and I got another 24 tasks to run again as usual. Single-cores since I still will have to try a single multi-core task with 2 cores to see if that was fixed before I try running 12 of those. Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
It looks like we are having a CMS single-core problem again and I stopped mine from getting more work and I see one of our hidden members are having the same problem and running more and hopefully will see this and suspend. https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=4424 Could be if somebody got around to making these run multi-core CMS so we an actually test some of those but we have no update here about that since Ivan stopped by and he sent them a message. Maybe I will run some Theory ( and check over at Sixtrack to see if they have any CMS problem) Mad Scientist For Life |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Oy Glad I only tried one new CMS I have seen these before and they usually run for an hour and then crash. ......back to Theory |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
It looks like single-core CMS is working again. Multi-core is another question ....I will give one another try. |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
Multi-core are still not running. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970119 |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970375 This may be just luck but I had a single-core CMS at about 5.5 hours when this laptop froze up and I had to turn it off and back on to get it to work again and for some reason it started running again just like they do when you first start them but in the log it still showed the first 5.5 hours and all the next 13+ hours ( CPU time 9 hours 45 min ) So they must be working again so I'll start up a new one. |
Send message Joined: 22 Apr 16 Posts: 671 Credit: 1,882,462 RAC: 7,096 |
Multi-core are still not running. Have testing with 2-Core, no subtasks for multicore: 2021-05-27 21:28:04 (10896): Guest Log: 05/27/21 21:27:56 **** condor_startd (condor_STARTD) pid 10794 EXITING WITH STATUS 0 2021-05-27 21:28:04 (10896): Guest Log: [ERROR] No jobs were available to run. 2021-05-27 21:28:04 (10896): Guest Log: [INFO] Shutting Down. https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2970365 |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
A couple days ago I asked for the 2 tasks I run on this host and I got this 6/2/2021 2:12:13 AM | lhcathome-dev | Scheduler request failed: Failure when receiving data from the peer 6/2/2021 2:12:14 AM | | Project communication failed: attempting access to reference site 6/2/2021 2:12:17 AM | | Internet access OK - project servers may be temporarily down. But when I check my account it said I did get 2 tasks but nothing was sent to me. So I tried again and this time I did get 2 but now it says I have 4 tasks I wonder where they actually went since after doing another update I get the usual 2 tasks but my account still says I have 4 https://lhcathomedev.cern.ch/lhcathome-dev/results.php?userid=192 Maybe on the due date for those two on the 9th they will disappear or give me Timed out - no response |
Send message Joined: 8 Apr 15 Posts: 751 Credit: 11,614,380 RAC: 1,286 |
11th year running VB tasks and still a pain in the ass I can get 300 Valids in a row and then they just run when they feel like it and even nice enough to run 13 hours before crashing and no all of my computers didn't fail at the same time and my internet speed is over 30Mbps or more and I even only start up one or two at a time And one started within 5 minutes of the other one ....almost 4am.....and this isn't a question |
©2024 CERN