Message boards :
News :
No new jobs
Message board moderation
Previous · 1 · 2 · 3 · 4 · 5 · 6 . . . 13 · Next
Author | Message |
---|---|
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
automagically I like that word,Ivan.It is very descriptive. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
How long is this time-out ? ... Two hours. I can change it. At the moment it doesn't seem a big problem; if/when we start growing then I can do a statistical analysis on how many jobs time out as a function of LeaseTime but at the moment it's in the noise. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
I'm afraid to say, I believe I got it from Jerry Pournelle, with whom I've had my differences...automagically But, it worked! Jobs are joining the queue without my intervention. Props to Marco and Hassen at CERN and Andrew at RAL who worked overtime to discover the problem, find a cure, and apply it. I'm not sure whether we lost jobs or not, the job numbers are in the 8,000s (out of 10,000) but maybe it'll cycle around back to the low numbers again. Bottom line is I think we've got enough queued to last the night, I can sweat the details in the morning. Thanks everyone for your patience, and remember that these are a different class of job to recent ones, so your run-times, etc., will be different. |
Send message Joined: 4 May 15 Posts: 64 Credit: 55,584 RAC: 0 |
automagically It's been in use at SETI since at least August 2004, which was the test phase of their conversion to BOINC. So you may have caught it from someone else... |
Send message Joined: 8 Apr 15 Posts: 781 Credit: 12,422,653 RAC: 2,032 |
(I only run these on 5 hosts but they always have CMS-dev work) Mad Scientist For Life |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
automagically Well, this was more like 1984, when he was writing a column in BYTE. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Unfortunately the jobs were all returning errors so I removed them and submitted a new batch. These haven't found their way into the Condor queue yet. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Unfortunately the jobs were all returning errors so I removed them and submitted a new batch. These haven't found their way into the Condor queue yet. OK, a new batch is up and filling the queue. Thanks for your patience. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
I might be wrong, but I don't think these new ones are any good either. They are going extremely fast, will look at logs to see what is going on... |
Send message Joined: 13 Feb 15 Posts: 1188 Credit: 862,257 RAC: 15 |
Running well. 25 events / job |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
Have latched on to job 40, but only at 13:10 BST, some time after they appeared in the queue and you posted. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
I might be wrong, but I don't think these new ones are any good either. They are going extremely fast, will look at logs to see what is going on... They are taking between 18 minutes and two hours according to Dashboard. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
Have latched on to job 40, but only at 13:10 BST, some time after they appeared in the queue and you posted. I posted at 13:06 BST; the submission was at 13:03, job 40 is listed as running from 13:04 to 14:18. Mixing BST and UTC can get confusing, roll on 25/10/2015! |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
Have latched on to job 40, but only at 13:10 BST, some time after they appeared in the queue and you posted. It was easier when I was sitting on a Pacific island ! |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
I have 30 run directories. When the new lot was issued, it started to mix the results into run-1 and run-2 directories. I would be easier, to continue the run directories, not reset the numbering and mix the files into existing directories. And is suggest, to keep everything to UTC(no BST or CEST), as all other worldwide projects do. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
The "running jobs " graphics are not really showing running jobs. They a showing computers, that are running tasks, but not necessarily processing jobs. I think, this should be labeled more appropriately. |
Send message Joined: 20 Jan 15 Posts: 1139 Credit: 8,310,612 RAC: 75 |
The "running jobs " graphics are not really showing running jobs. No, that's not quite right. The statistics at the top come from the Condor queue; in full the summary line looks like: 1073 jobs; 0 completed, 0 removed, 1002 idle, 64 running, 7 held, 0 suspended The graphs come from Dashboard -- http://dashboard.cern.ch/cms/ -- which knows nothing about our BOINC tasks, only what Condor reports back to it about job status. (Most of the links on that page require CMS credentials; I am told that you can get through to my jobs on the "CRAB3 User Summary" link, but, if you do, make sure you adjust "filters" to a longer time period than one day.) The plot is somehow averaging the Condor running-jobs statistic over one hour. The plots come from the "Historical view" link but I'm not sure if that's publicly accessible. |
Send message Joined: 16 Aug 15 Posts: 966 Credit: 1,211,816 RAC: 0 |
So why does the graph not even show a dip, when we know, there was no work running for several hours? |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
Historical View is accessible and you can see those charts if you select T3s, then select T3_CH_Volunteer (click Done) then click on 'Completed, Submitted, Pending, Running Jobs' button. The 'NEW: Historical View' menu item lower down in the list just times out for me. |
Send message Joined: 20 May 15 Posts: 217 Credit: 6,193,119 RAC: 975 |
I did ask if I could break it... Internal Server Error The server encountered an internal error or misconfiguration and was unable to complete your request. Please contact the server administrator, dashboard-alarms@cern.ch and inform them of the time the error occurred, and anything you might have done that may have caused the error. More information about this error may be available in the server error log. Apache/2.2.15 (Red Hat) Server at dashb-cms-job.cern.ch Port 80 I was just pressing things without knowing whether what I was doing was valid or not ! |
©2024 CERN