Message boards : News : No new jobs
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 13 · Next

AuthorMessage
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,683
RAC: 53
Message 1994 - Posted: 13 Feb 2016, 21:32:26 UTC - in response to Message 1985.  

Yes we can! It is the standard 80/20 split. We put 20% of the effort in to get it 80% working and will spend 80% of the effort to get the final 20% working. The focus now should shift towards looking at the failures, prioritizing them, then dealing with them.
ID: 1994 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1995 - Posted: 13 Feb 2016, 21:40:24 UTC - in response to Message 1994.  
Last modified: 13 Feb 2016, 21:42:35 UTC

Please have a look at:
http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=100&postid=1993

Single hosts or volunteers causing large number of fails need to be stopped/fixed.

If you need the IP, i can PM it to you.
ID: 1995 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2018 - Posted: 16 Feb 2016, 17:40:51 UTC

Just a reminder, we probably need a new batch within the next 12-15h.
ID: 2018 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,970,502
RAC: 2,962
Message 2025 - Posted: 16 Feb 2016, 20:03:36 UTC - in response to Message 2018.  

Just a reminder, we probably need a new batch within the next 12-15h.

Watching it.
ID: 2025 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 581
Message 2079 - Posted: 24 Feb 2016, 8:48:05 UTC

Pending 8193 and Queued 0 and not getting jobs.

Something wrong?
ID: 2079 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1067
Credit: 329,683
RAC: 53
Message 2080 - Posted: 24 Feb 2016, 9:25:46 UTC - in response to Message 2079.  

Possibly. I made a request yesterday to investigate why we had over 1K jobs running but only about 50 tasks. I guess someone is looking into it now.
ID: 2080 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 850,198
RAC: 581
Message 2081 - Posted: 24 Feb 2016, 15:45:57 UTC - in response to Message 2080.  
Last modified: 24 Feb 2016, 16:44:28 UTC

I've a job running now from the wmagent queue: Batch riahi_TEST_VOLUNTEER_0224-T3_CH_VolunteerBackfill_160224_144611_7271
JobId 424438de-db04-11e5-b410-001dd8b71c94-11_0

No job.log and no Alt+F5 Console output.

Edit: This time returning the root-file (66MB) succeeded in 1 attempt.
------ Overall job wallclock time 5969s
ID: 2081 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,970,502
RAC: 2,962
Message 2082 - Posted: 24 Feb 2016, 16:46:47 UTC

Sorry for the lack of news, I've had my head down since Monday with a cold or 'flu that's taking longer to clear than I anticipated. Seems the necessary security updates for the recent glibc problem have had some side-effects. Not sure how long it will take to clear, it's not the best time of day to try to re-establish communications with the team. :-(
ID: 2082 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2083 - Posted: 24 Feb 2016, 16:52:59 UTC

Glad to hear, that you are better, now.

It seems, riahi has his WMAgent job working.
Producing valid results, now.
ID: 2083 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,970,502
RAC: 2,962
Message 2084 - Posted: 24 Feb 2016, 18:47:08 UTC - in response to Message 2083.  

Glad to hear, that you are better, now.

It seems, riahi has his WMAgent job working.
Producing valid results, now.

Oh, that is good news!
ID: 2084 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2085 - Posted: 25 Feb 2016, 19:19:19 UTC

Anything happening?
No jobs running.
No Info.
No nothing.
ID: 2085 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 757
Credit: 11,760,036
RAC: 3,674
Message 2086 - Posted: 25 Feb 2016, 21:40:35 UTC - in response to Message 2085.  
Last modified: 25 Feb 2016, 21:42:35 UTC

Anything happening?
No jobs running.
No Info.
No nothing.


I just sent one back and got a new one with no problem and the server status says 101 to send.

(only running one right now because.......)

About 43 minutes ago.

http://boincai05.cern.ch/CMS-dev/result.php?resultid=114848
Mad Scientist For Life
ID: 2086 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2087 - Posted: 25 Feb 2016, 23:35:19 UTC - in response to Message 2086.  

Thanks, Magic.
Nice finished task with absolutely no work done.
ID: 2087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 45
Message 2088 - Posted: 26 Feb 2016, 1:42:55 UTC - in response to Message 2087.  

...task with absolutely no work done.

Same here. Glidein keeps trying but is failing to load various files
(different hosts, different attempts, different files) from some server
at RAL. Goes to sleep, tries again etc.

For example:-
Fri Feb 26 01:24:59 GMT 2016 Failed to load file 'condor_config.dedicated_starter.ebsdRw.include' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'.

Set NNW on everything. To bed.
ID: 2088 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 757
Credit: 11,760,036
RAC: 3,674
Message 2089 - Posted: 26 Feb 2016, 5:17:37 UTC
Last modified: 26 Feb 2016, 5:32:01 UTC

..........
ID: 2089 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 757
Credit: 11,760,036
RAC: 3,674
Message 2090 - Posted: 26 Feb 2016, 5:20:45 UTC

Well I tend to NOT sit and watch these run but I did check a couple times and they were running CMS glidein runs....ending and starting another and last one I saw was 21

One running right now is at run CMS application 8 after glidein 7 ended.

Pretty much why I decided to just come back here and run only one host.

Since the beginning all I have heard here was other members complaining about what MY host is doing.

And just imagine how much I care about that.

Glad I have been with Einstein for 11 years since that never happens there.

It seems to just be part of ALL of the VB projects......even makes old Sixtrack look good.



But then the CMS Server is here in my house
Mad Scientist For Life
ID: 2090 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2349 - Posted: 11 Mar 2016, 18:58:33 UTC

We are going to need a new batch, soon.

What is it going to be? 2500 jobs?
ID: 2349 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,970,502
RAC: 2,962
Message 2352 - Posted: 11 Mar 2016, 22:05:46 UTC - in response to Message 2349.  

We are going to need a new batch, soon.

What is it going to be? 2500 jobs?

I'm watching it. Hopefully we get down to ~100 during civil hours, otherwise I'll have to jump the gun. Yes, we've got fewer hosts participating at the moment and we need to get some statistics on stable populations for the suspend/resume study, so 2500 sounds about right. Maybe 2000, but that increases my workload. :-)
ID: 2352 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2354 - Posted: 11 Mar 2016, 23:13:49 UTC - in response to Message 2352.  
Last modified: 11 Mar 2016, 23:16:34 UTC

Thanks, Ivan.
So, do we want to declare the new batch a "clean run"?
Just to get some baseline figures on the errors.

That would mean no fiddling like aborting, detaching, suspending, restarting etc.
Unless it is absolutely necessary.
(Will be hardest for me, but it's for science.)

EDIT: And the same applies to server settings, fingers off!
ID: 2354 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,970,502
RAC: 2,962
Message 2358 - Posted: 11 Mar 2016, 23:52:11 UTC - in response to Message 2354.  

Thanks, Ivan.
So, do we want to declare the new batch a "clean run"?
Just to get some baseline figures on the errors.

That would mean no fiddling like aborting, detaching, suspending, restarting etc.
Unless it is absolutely necessary.
(Will be hardest for me, but it's for science.)

EDIT: And the same applies to server settings, fingers off!

Yeah, I'd hope so. Let's get a baseline and go on from there. More reason to go for 2,000 over 2,500 I guess; sqrt(N) isn't too different.
Oh, and if there's something I can change with my Condor modification script we can be sure it applies to only one batch at a time. :-)
ID: 2358 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 13 · Next

Message boards : News : No new jobs


©2024 CERN