Message boards : Number crunching : issue of the day
Message board moderation

To post messages, you must log in.

Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2353 - Posted: 11 Mar 2016, 22:10:51 UTC - in response to Message 2343.  
Last modified: 11 Mar 2016, 22:24:35 UTC

OK, here's what's happening.

Ivan's proxy expires when the job is in the queue on the server. A VM requests a new job and the jobs fails. It does not even start as the site is recorded as unknown. The job is resubmitted and Ivan's script eventually renews the proxy. You then get the good job.

There is nothing wrong with the volunteer side of things, it is just noise created by the proxy expiring.

Does that mean I have to renew the proxy sooner than the expected expiry? Rather than any time up to the actual time? Because it's not yet a script, and even if it were I still have to generate the new proxy manually. Is there something I can put into my "condor modifying" script (see job_commands.xxx)?

Oh, b*gger, we seem to have a maximum outage of some sort at work, I can log into the uni's gateway but not to any of my machines. I'm going to have to take a midnight stroll up the road to see if I can recover anything tonight...

#Baby it's cold outside,
ID: 2353 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Laurence
Project administrator
Project developer
Project tester
Avatar

Send message
Joined: 12 Sep 14
Posts: 1064
Credit: 325,950
RAC: 278
Message 2355 - Posted: 11 Mar 2016, 23:27:55 UTC - in response to Message 2344.  

Rasputin,

If you look at the plots on the CMS Jobs page, as long as the running jobs and wall-time plots show less than 10% of jobs failing, I am not too worried. The pie chart shows the failure modes and most of due to stage-out errors, which is not surprising and we can hopefully improve on that over time.

From the 5 job ids that you posted, two were failed jobs from different IPs and the other three were not sent.

I am still interested to chase down potential errors so please let me know if you think there is one machine that is behaving badly.
ID: 2355 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2356 - Posted: 11 Mar 2016, 23:45:54 UTC - in response to Message 2353.  

Seems to be a break in the network to our building, I had indications that intra-building connections were still up. Security chap was spectacularly uninterested until I explained about ten times that the CC has a Network Duty Officer on call 24/7, so eventually he deigned to call someone else to find if they had the NDO's number. I left him with explicit instruction to tell the NDO about the outage when they found the number. (It can't be "if", can it, for such an important function? I'll be following up in high dudgeon Monday morning!)
ID: 2356 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2357 - Posted: 11 Mar 2016, 23:47:16 UTC - in response to Message 2355.  

Thanks Laurence.

I was just trying to point out, that by my interpretation, there was an IP address, that was (is?) producing lots and lots of abandoned task, which would not show as errors on dashboard or any other statistics.

However, if there is some kind of mechanism, that causes this, it should be eliminated.
There is no real loss to the project, but the volunteer would be wasting his resources.

I will keep an eye on it.

You have bigger fish to catch, so i will let it be for now.
ID: 2357 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2359 - Posted: 12 Mar 2016, 0:09:04 UTC - in response to Message 2357.  

We can track down such hosts (if it's stage-out errors it's usually multiple hosts on the same NAT address). Getting a response from PM or (as a last resort) email is problematic.
ID: 2359 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2360 - Posted: 12 Mar 2016, 0:43:05 UTC

I just had a thought. If people had not noticed the changeover to the new project name and continued, would they not produce lots of abandoned tasks?
Like in this case:
http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=78&postid=2345
ID: 2360 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2370 - Posted: 12 Mar 2016, 12:57:12 UTC

A 2nd? run started.Has this been fixed?


Now it is back to where it was. No 2nd run.

2016-03-12 13:18:19 (3052): Guest Log: [INFO] No more jobs. Shutting down!
ID: 2370 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2376 - Posted: 12 Mar 2016, 21:31:12 UTC

Apologies to all involved.
My posts about the aborted jobs all from the same IP---that was me.

I just discovered, that a new IP was assigned to me.
It has been the same fore a long time, so i never rechecked.

Mea culpa--will be more careful next time.
ID: 2376 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 2377 - Posted: 12 Mar 2016, 22:46:26 UTC - in response to Message 2376.  

Dinnae fash yersel, Jimmy, it coud happen t'annyone.
ID: 2377 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 2378 - Posted: 12 Mar 2016, 23:40:21 UTC

Mo chreach
ID: 2378 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 2379 - Posted: 12 Mar 2016, 23:54:29 UTC

éí ííníyą́ąʼgo naaʼdidoolyééł
ID: 2379 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 2380 - Posted: 13 Mar 2016, 1:09:25 UTC - in response to Message 2379.  
Last modified: 13 Mar 2016, 1:16:28 UTC

Navajo ???


Btw while I'm here I still get these in a batch after having one that claims to be Valid

http://atlasathome.cern.ch/result.php?resultid=4191239

VM Heartbeat file specified, but missing.
2016-03-12 16:37:08 (3616): VM Heartbeat file specified, but missing file system status. (errno = '2')

So I guess as far as my test host goes that new version only works when it feels like it (it has a heart I am pretty sure)

Oy Vey

I know VB is running since it has no problem with vLHC or Atlas

I NEVER tried a *suspend* or *reboot* just because I wouldn't mind if they worked all the time instead of how it does.

(Windows 10)

.....looks like Kansas will win this game

(it also says I have one in progress but I do not.....but it did this once before)
ID: 2380 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 874,518
RAC: 460
Message 2381 - Posted: 13 Mar 2016, 10:58:31 UTC - in response to Message 2360.  

I just had a thought. If people had not noticed the changeover to the new project name and continued, would they not produce lots of abandoned tasks?
Like in this case:
http://lhcathomedev.cern.ch/vLHCathome-dev/forum_thread.php?id=78&postid=2345

Another thought. If someone is using an app_config file this will be lost when they detach and re-attach. Is the app name still the same (CMS)?
ID: 2381 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 2382 - Posted: 13 Mar 2016, 14:42:32 UTC - in response to Message 2381.  

If someone is using an app_config file this will be lost when they detach and re-attach. Is the app name still the same (CMS)?

app-names CMS and LHCb
ID: 2382 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5017 - Posted: 26 Jun 2017, 19:45:12 UTC

This is the first time I ever got these tasks on this 8-core computer



First pc of the day for me to check so I thought my eyes were blurry still so I hope the rest of them don't have 32-core tasks loaded.

Not sure why this happened or what will happen next and for some reason it is only running the one 2-core task and the others are just waiting (2,3,and 4 core tasks)

It had been running 2-core tasks since those started here.
Mad Scientist For Life
ID: 5017 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1128
Credit: 7,870,419
RAC: 595
Message 5020 - Posted: 27 Jun 2017, 7:41:31 UTC - in response to Message 5017.  

What application/jobnames?
ID: 5020 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
nikogianna
Project administrator
Project developer
Project tester

Send message
Joined: 21 Feb 17
Posts: 21
Credit: 195,770
RAC: 0
Message 5021 - Posted: 27 Jun 2017, 7:47:48 UTC - in response to Message 5017.  

This is related to some tests we ran on the scheduler and at some point tasks that use the MAX_CPUS were sent. Please delete these tasks as they will not run. New tasks should be sent with the correct preferences.

Sorry for the inconvenience.
ID: 5021 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1178
Credit: 810,202
RAC: 2,083
Message 5022 - Posted: 27 Jun 2017, 7:57:07 UTC - in response to Message 5017.  

Not sure why this happened or what will happen next and for some reason it is only running the one 2-core task and the others are just waiting (2,3,and 4 core tasks)

It had been running 2-core tasks since those started here.

Somehow the server thinks (temporary) the machine has 32 cores and also ignored your preferences.
BOINC probably allocates the memory belonging to a 32 core task for the other tasks too and therefore
will not start a second task because BOINC thinks no memory enough for such a huge memory demanding task.
ID: 5022 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 734
Credit: 11,558,055
RAC: 2,030
Message 5023 - Posted: 27 Jun 2017, 9:34:40 UTC
Last modified: 27 Jun 2017, 9:45:34 UTC

These are Theory tasks and of course they only ran for the 25mins until they figured out there were no 32 cores ready for it (I only saw the first one start and do that) and I let the one that was running (2-cpu) finish and while I was outside it tried the other two and sent them back Invalid and went back to running and completing the 2,3,and 4 core tasks I was testing.

After they were finished I switched this one to run 3-core tasks X2 on this 8-core.

Of course I went right up and checked the other 5 pc's I had running and since I had both vLHC-dev and LHC Theory tasks (and SixTrack) running it had not got to the point of getting new tasks here.......so I let them finish all those tasks and reloaded them with vLHC Theory X2 tasks on those 32 cores (5 separate computers)

Back to normal.....but it was a bit strange which is why I had to take a snap shot of that

(Edit: Oh btw while I am here,when I check my own account Computers page here it has not been updating them since June 23rd and I tend to use that just to see when a computer has contacted the server so I can see that from this laptop instead of going upstairs to look at each one every time)
Mad Scientist For Life
ID: 5023 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Tern

Send message
Joined: 21 Sep 15
Posts: 89
Credit: 383,017
RAC: 0
Message 5087 - Posted: 16 Aug 2017, 20:39:18 UTC

"Server error: feeder not running"
ID: 5087 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Previous · 1 . . . 7 · 8 · 9 · 10 · 11 · Next

Message boards : Number crunching : issue of the day


©2024 CERN