Message boards : CMS Application : New version 49.00
Message board moderation

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

AuthorMessage
emoga

Send message
Joined: 26 Apr 15
Posts: 6
Credit: 10,043,919
RAC: 0
Message 6588 - Posted: 1 Sep 2019, 1:17:52 UTC

Although the Stderr claims it's a '1 core' task...It's actually a 32 core task (or the max core count of whatever computer I'm running with the 'Unlimited setting' set in the preferences)
and was changed via app_config to 1 core. I like to squeeze in some Theory Native tasks and since the 32 core (or 16 core) CMS task won't actually use all the cores all the time, this allows me to
juggle both sub-projects.

I actually think this may have fixed my problems as I was running too many single CMS work units at a time and perhaps my network was overloaded.(no local proxy yet)
Or maybe it was something on the server side that was changed? Sill getting '1 (0x00000001) Unknown error code' on my larger 72 core systems so everything isn't perfect yet.

Hope it works out for you Magic,
I would hate for you to stay up till 2 am just to get invalids.


Cheers
ID: 6588 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6589 - Posted: 2 Sep 2019, 2:19:10 UTC - in response to Message 6588.  

Thanks neighbor from the north

Yeah I don't have any machines here with more than 8 cores or 24GB ram (and Hughes satellite isp) so it can start and then end up here once in a while

Double
ID: 6589 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6606 - Posted: 11 Sep 2019, 6:12:34 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2819586

I have seen these EXIT_NO_SUB_TASKS before but the rest of this stderr is one I have never seen before.
It looks more like the Linux version I have seen others get but not on a Windows OS task.

The one just before this was the typical error ---- Unknown error code---- [ERROR] Condor ended after 686 seconds.
I have another task on this host I may start later tonight and see if it actually runs (24GB ram) running 3-core CMS tasks..
ID: 6606 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
computezrmle
Volunteer moderator
Project tester
Volunteer developer
Volunteer tester
Help desk expert
Avatar

Send message
Joined: 28 Jul 16
Posts: 467
Credit: 389,411
RAC: 503
Message 6607 - Posted: 11 Sep 2019, 6:43:16 UTC - in response to Message 6606.  

Everything that is marked as "Guest Log:" is a copy from a logfile inside the VM.
Therefore it's linux flavour.

Most lines are taken from the VM's MasterLog:
2019-09-10 22:32:53 (4144): Guest Log: Here is the MasterLog
.
.
.


Either "207 (0x000000CF) EXIT_NO_SUB_TASKS" or "1 (0x00000001) Unknown error code" is reported when the condor queue is dry but to me it's unclear why it's sometimes the 1st and sometimes the 2nd error message.
Might be (just a guess) that both are reported but the error handler forwards only the one that arrives first. (?)

Nonetheless it's not an error on the client side.
ID: 6607 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6635 - Posted: 17 Sep 2019, 6:53:50 UTC
Last modified: 17 Sep 2019, 6:55:07 UTC

Well after finally getting 12 in a row Valids here I was going to ask Ivan for more of these (without all of them going to those pc's with *no limit" set)

BUT just checking since I was here checking the server I see today I did have a *unknown error* task after 5 hours
(VM Completion Message: Condor ended after 17938 seconds) again
And then another typical 27 minute VM Completion Message: Condor ended after 681 seconds

(sounds like my luck at the bowling alley today)

This time I know it had nothing to do with my isp speed since it is only day 3 of my new month (and I have been running mostly Sixtracks with the ethernet unplugged to see how long I can make the high speed last}

If I get all of them finished I was going to get them back here to run CMS-dev tasks since it seemed they finally were dependable........still haven't ran any over at LHC
(I'm not asking any questions,just want another CMS for this pc since it was getting Valids every day)

Any of you make the trip to Cern this weekend (ok that is a question)
ID: 6635 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6722 - Posted: 1 Oct 2019, 0:59:57 UTC

Grrrrrrrr

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2827065

This is how a laptop can waste 5.5 hours when the battery goes dead enough to shut down and because VB likes to crash unless it is suspended first.

Damn jack where the ac/dc power plugs in to charge the battery is the only part of this laptop that was close to pitiful (the rest as far as cpu and ram and gpu were pretty good considering it has run 24/7 since June 2012.
I even took this laptop apart before and the way that jack is installed it terrible and put in the wrong place so now if the plug is not in at the right angle the thing stops charging and the battery goes dead in mere minutes (and it is new) .......so time to start up a new pair of CMS X2 tasks and hope it doesn't happen again.
ID: 6722 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6754 - Posted: 11 Oct 2019, 9:37:18 UTC
Last modified: 11 Oct 2019, 9:38:23 UTC

EXIT_NO_SUB_TASKS

We are all having a problem with the CMS server once again and I just happened to be on to catch it happening (as usual)
So We need you to send them a message for us Ivan and we all probably need to suspend our CMS tasks so we don't all get a few pages of Invalids.

(Or Laurence)

I see it on all of Ivans and also the host wHewitt has running these.....and mine of course.

They had been working perfect for a while but I still have to watch closely since I never trust a server.
Mad Scientist For Life
ID: 6754 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6757 - Posted: 12 Oct 2019, 3:48:16 UTC

OK it looks like we are up and running CMS again.

I have a few running and one has 3 hours running time so far.
ID: 6757 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6758 - Posted: 12 Oct 2019, 13:29:49 UTC

I guess not up and running.

I got one to run a Valid task but since then .....

EXIT_INIT_FAILURE and a couple [ERROR] Condor ended after 720 seconds.

And a [ERROR] Condor ended after 30008 seconds. after running 8 hours 38 min 42 sec

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2829582

Back to Suspension [/b]
ID: 6758 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6759 - Posted: 12 Oct 2019, 13:30:24 UTC
Last modified: 12 Oct 2019, 13:31:58 UTC

I guess not up and running.

I got one to run a Valid task but since then .....

EXIT_INIT_FAILURE and a couple [ERROR] Condor ended after 720 seconds.

And a [ERROR] Condor ended after 30008 seconds. after running 8 hours 38 min 42 sec

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2829582

Back to Suspension

(Ivan is also having no luck with his)
ID: 6759 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6760 - Posted: 13 Oct 2019, 7:30:39 UTC

OK we are officially back to work here again.
ID: 6760 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6761 - Posted: 13 Oct 2019, 22:13:42 UTC

I decided to add another pc to CMS here so I did the 12 hour+ d/l of that vdi and then tried to d/l a couple tasks BUT it refused to d/l a task several times and since I am anti-invalids and errors I sent them back and switched it back to just Sixtracks (and Einstein GPU's since they always work)

It would start saying it was d/ling on the *Tasks* page but on the *Transfers* page it was blank every time.

Exit status -186 (0xFFFFFF46) ERR_RESULT_DOWNLOAD

<core_client_version>7.14.2</core_client_version>
<![CDATA[
<message>
app_version download error: couldn't get input files:
<file_xfer_error>
<file_name>CMS_2019_03_25.vdi</file_name>
<error_code>-105 (fwrite() failed)</error_code>
</file_xfer_error>
</message>
]]>

When I get a chance I will look at that vdi file and see what or where it says that is......since I watched it d/l that vdi
ID: 6761 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6766 - Posted: 19 Oct 2019, 21:19:52 UTC
Last modified: 19 Oct 2019, 21:30:07 UTC

Not this again????




I decided to start running these again on my fastest pc with the most ram along with some Sixtracks and now I get this happening again with the CMS

Running this startup for 40 minutes so far and this is as far as it got.

This was fixed back in May as I remember and I sure hope when the other 2 hosts I have running here finish the running tasks don't do this again.
(which means I will have to start up new ones after 2am so I have the fastest isp speed as possible)

I guess I will add this to show that it happens on a Windows OS even if you live in London
https://lhcathomedev.cern.ch/lhcathome-dev/results.php?hostid=3907

of course it is saturday and only certain people work 24/7 for 15 years non-stop for Cern.....
ID: 6766 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6767 - Posted: 19 Oct 2019, 23:05:05 UTC - in response to Message 6766.  
Last modified: 19 Oct 2019, 23:52:40 UTC

guess Ivan got that EXIT_NO_SUB_TASKS again

I got the Unknown error code......started another one so I won't know for another 30 mins if this will get back to normal but still haven't tried that other host yet doing 15 things at the same time)

Edit: trying a new one on that host again (it was running these 8 days ago but that is when we had a problem that was taken care of 2 days later so we will see if it was just an update that it didn't get on the 13th and so far it is running the RDC normal this time.
It made it to HTCondor ping 0 in just over 12mins so that should work........

YEP that was what happened

ID: 6767 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6797 - Posted: 1 Nov 2019, 20:19:21 UTC

For some reason the last few hours the Cern server and my satellite are running at the same speed as a snail up a cactus.

The tasks that have been running for hours are ok but the one I am trying to start a new task on is not getting very far (it does have 109 Valids)

Usually over the years if I caught this happening I would just abort it and start a new one but it failed twice along with these other 2 that ended up as 194 (0x000000C2) EXIT_ABORTED_BY_CLIENT which means the Server aborted it and if it is aborted by me it is 203 (0x000000CB) EXIT_ABORTED_VIA_GUI

They started with the typical *VM Heartbeat file specified, but missing file system status* but then all these other errors which are not typical Watcher ERROR [COM]: aRC=E_ACCESSDENIED and all the rest.

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833902

As I am typing this I decided to let one that started as Failed run to see what the server does this time but not allowing a new task just in case.....( I don't trust my Hughes satellite dish any farther than I can throw it)



(and I just lost one that had run close to 5 hours on another host)

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2833812

as usual when I get a long list of Valids with these then this happens which means they have to be watched like little baby CMS's and it even takes several minutes just to log in here and make it this far
ID: 6797 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6809 - Posted: 9 Nov 2019, 22:08:08 UTC
Last modified: 9 Nov 2019, 22:17:10 UTC



It was getting close to 30 minutes running so I suspended it and restarted the task and seconds later......

ID: 6809 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6843 - Posted: 22 Nov 2019, 1:38:33 UTC
Last modified: 22 Nov 2019, 2:36:17 UTC

ID: 6843 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6844 - Posted: 25 Nov 2019, 2:30:44 UTC

Well I got 3 Valids out of the 7 I tried last night but that is better than nothing
Maybe by monday night we will have the jobs get back to work here and over at LHC

(maybe I will use CMS as my name at the bowling alley in the morning and see if that works)
ID: 6844 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6938 - Posted: 13 Jan 2020, 9:09:12 UTC
Last modified: 13 Jan 2020, 10:05:42 UTC

1/13/2020 1:03:21 AM | lhcathome-dev | Tasks won't finish in time: BOINC runs 99.9% of the time; computation is enabled 100.0% of that

Something I would remove from the server/

Just wasted one hour trying to get new tasks here since my high-speed for the month started one hour ago and instead of some CMS I get this. (I have ZERO other tasks loaded other than GPU Einsteins that have nothing to do with this)

every cuss word you can think of
It sure would be nice if we could EVER trust these to run and I'm willing to bet when this happens over at the public LHC site it will convince them to go back to a program that doesn't fail and run the same version over and over.



2am and they either do that or (excuse me while I switch over to another pc running these


[ERROR] Condor ended after 663 seconds.

2020-01-13 01:50:10 (10388): Guest Log: [INFO] Shutting Down.

2020-01-13 01:50:10 (10388): VM Completion File Detected.
2020-01-13 01:50:10 (10388): VM Completion Message: Condor ended after 663 seconds.


2am so that is enough headache for even me so I will just save this high-speed until tomorrow
ID: 6938 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 6970 - Posted: 29 Jan 2020, 19:45:58 UTC

https://lhcathomedev.cern.ch/lhcathome-dev/result.php?resultid=2862841

This is what will happen if one of these VB tasks are running and then you get a page full of Sixtracks so the VB tasks is suspended by Boinc and then when it is time to restart that VB task it just crashes and goes to the Error file.
ID: 6970 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next

Message boards : CMS Application : New version 49.00


©2024 CERN