Message boards : News : No new jobs
Message board moderation
Previous · 1 . . . 6 · 7 · 8 · 9 · 10 · 11 · 12 . . . 13 · Next
Author | Message |
---|---|
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1071 Credit: 334,951 RAC: 3 ![]() |
Yes we can! It is the standard 80/20 split. We put 20% of the effort in to get it 80% working and will spend 80% of the effort to get the final 20% working. The focus now should shift towards looking at the failures, prioritizing them, then dealing with them. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Please have a look at: http://boincai05.cern.ch/CMS-dev/forum_thread.php?id=100&postid=1993 Single hosts or volunteers causing large number of fails need to be stopped/fixed. If you need the IP, i can PM it to you. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Just a reminder, we probably need a new batch within the next 12-15h. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Just a reminder, we probably need a new batch within the next 12-15h. Watching it. ![]() |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 33 ![]() ![]() |
Pending 8193 and Queued 0 and not getting jobs. Something wrong? |
![]() ![]() Send message Joined: 12 Sep 14 Posts: 1071 Credit: 334,951 RAC: 3 ![]() |
Possibly. I made a request yesterday to investigate why we had over 1K jobs running but only about 50 tasks. I guess someone is looking into it now. |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 878,593 RAC: 33 ![]() ![]() |
I've a job running now from the wmagent queue: Batch riahi_TEST_VOLUNTEER_0224-T3_CH_VolunteerBackfill_160224_144611_7271 JobId 424438de-db04-11e5-b410-001dd8b71c94-11_0 No job.log and no Alt+F5 Console output. Edit: This time returning the root-file (66MB) succeeded in 1 attempt. ------ Overall job wallclock time 5969s |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Sorry for the lack of news, I've had my head down since Monday with a cold or 'flu that's taking longer to clear than I anticipated. Seems the necessary security updates for the recent glibc problem have had some side-effects. Not sure how long it will take to clear, it's not the best time of day to try to re-establish communications with the team. :-( ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Glad to hear, that you are better, now. It seems, riahi has his WMAgent job working. Producing valid results, now. |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Glad to hear, that you are better, now. Oh, that is good news! ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Anything happening? No jobs running. No Info. No nothing. |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 789 Credit: 13,093,014 RAC: 12,207 ![]() ![]() ![]() |
Anything happening? I just sent one back and got a new one with no problem and the server status says 101 to send. (only running one right now because.......) About 43 minutes ago. http://boincai05.cern.ch/CMS-dev/result.php?resultid=114848 Mad Scientist For Life ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks, Magic. Nice finished task with absolutely no work done. |
Send message Joined: 20 Mar 15 Posts: 243 Credit: 886,442 RAC: 0 ![]() ![]() |
...task with absolutely no work done. Same here. Glidein keeps trying but is failing to load various files (different hosts, different attempts, different files) from some server at RAL. Goes to sleep, tries again etc. For example:- Fri Feb 26 01:24:59 GMT 2016 Failed to load file 'condor_config.dedicated_starter.ebsdRw.include' from 'http://lcggwms01.gridpp.rl.ac.uk:8319/factory/stage/glidein_v3_2_7'. Set NNW on everything. To bed. |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 789 Credit: 13,093,014 RAC: 12,207 ![]() ![]() ![]() |
.......... |
![]() ![]() Send message Joined: 8 Apr 15 Posts: 789 Credit: 13,093,014 RAC: 12,207 ![]() ![]() ![]() |
Well I tend to NOT sit and watch these run but I did check a couple times and they were running CMS glidein runs....ending and starting another and last one I saw was 21 One running right now is at run CMS application 8 after glidein 7 ended. Pretty much why I decided to just come back here and run only one host. Since the beginning all I have heard here was other members complaining about what MY host is doing. And just imagine how much I care about that. Glad I have been with Einstein for 11 years since that never happens there. It seems to just be part of ALL of the VB projects......even makes old Sixtrack look good. ![]() But then the CMS Server is here in my house ![]() Mad Scientist For Life ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
We are going to need a new batch, soon. What is it going to be? 2500 jobs? |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
We are going to need a new batch, soon. I'm watching it. Hopefully we get down to ~100 during civil hours, otherwise I'll have to jump the gun. Yes, we've got fewer hosts participating at the moment and we need to get some statistics on stable populations for the suspend/resume study, so 2500 sounds about right. Maybe 2000, but that increases my workload. :-) ![]() |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 ![]() |
Thanks, Ivan. So, do we want to declare the new batch a "clean run"? Just to get some baseline figures on the errors. That would mean no fiddling like aborting, detaching, suspending, restarting etc. Unless it is absolutely necessary. (Will be hardest for me, but it's for science.) EDIT: And the same applies to server settings, fingers off! |
![]() ![]() Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 0 ![]() |
Thanks, Ivan. Yeah, I'd hope so. Let's get a baseline and go on from there. More reason to go for 2,000 over 2,500 I guess; sqrt(N) isn't too different. Oh, and if there's something I can change with my Condor modification script we can be sure it applies to only one batch at a time. :-) ![]() |
©2025 CERN