Message boards : News : New developments
Message board moderation

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1240 - Posted: 13 Oct 2015, 13:03:44 UTC
Last modified: 13 Oct 2015, 13:04:15 UTC

We're at the stage where we have to make disruptive changes to the workflow, in order to get the results onto the Grid from the data-bridge. At some point soon we'll start getting errors for jobs in the current batch, at which time I'll ditch the rest and submit a small test batch. If we're lucky that may be the end of it, we'll have to see.
Thanks in advance for your understanding.
ID: 1240 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1246 - Posted: 13 Oct 2015, 16:27:55 UTC - in response to Message 1240.  
Last modified: 13 Oct 2015, 17:50:32 UTC

Oops, we do seem to have broken it...
Waiting for feedback from the experts.

Later: modified script submitted.

Even later: unsuccessfully... :-(
ID: 1246 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1247 - Posted: 13 Oct 2015, 19:53:23 UTC - in response to Message 1246.  

Oh, well, looks like a quiet night. Take a rest, everyone.
ID: 1247 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1250 - Posted: 16 Oct 2015, 2:38:10 UTC - in response to Message 1247.  

Just a note, after a very long day (up at 0400 to go to CERN, back home at 2345, still "catching up" with things at 0330 [not my record, I used to do 36+ hours on the trot for several years at University Annual Balls, and once or twice as an Instrument Scientist when I wasn't quite sure that a new post-grad had grasped all the complexity of a very expensive experiment]).

You'll have probably noticed the strange behaviour of the "CMS Jobs" graphs. We made an (essential) change, but it's b0rked things. We're trying to work out what, and the best debug tool at the moment seems to be small batches of short jobs. Hence the weird Dashboard reportage.

As ever, we value your participation and your perseverance.
ID: 1250 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1251 - Posted: 16 Oct 2015, 8:35:47 UTC

How did your presentation go?
ID: 1251 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1252 - Posted: 16 Oct 2015, 9:56:23 UTC - in response to Message 1251.  

How did your presentation go?

Reasonably well. As Ben said afterwards, "At least they didn't tell you to stop!" It's hard to get the discussion going in the way we want, there is a lot of concern about the validity of the results [e.g. malicious Trojan horses] and this skews the dialogue. Still, a lot of new ideas to think upon. You will probably see some changes in the weeks ahead, perhaps even progress on more-than-one-job-at-a-time.
ID: 1252 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,466
Message 1253 - Posted: 16 Oct 2015, 10:07:38 UTC - in response to Message 1252.  

Reasonably well. As Ben said afterwards, "At least they didn't tell you to stop!"

It's good to hear, that Ben is still in a good shape ;)
ID: 1253 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1258 - Posted: 19 Oct 2015, 10:43:43 UTC

So, now at the Moment, are we running or not?

I see this in my logs:

12:18:01 +0200 2015-10-19 [INFO] CMS glidein Run 13 ended
12:19:01 +0200 2015-10-19 [INFO] Starting CMS Application - Run 14
12:19:01 +0200 2015-10-19 [INFO] Reading the BOINC volunteer's information
12:19:02 +0200 2015-10-19 [INFO] Volunteer: Yeti (250) Host: 495
12:19:02 +0200 2015-10-19 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
12:19:02 +0200 2015-10-19 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=Yeti 250/CN=1181170921
issuer : /O=Volunteer Computing/O=CERN/CN=Yeti 250
identity : /O=Volunteer Computing/O=CERN/CN=Yeti 250
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:03:58 (5.4 days)
12:19:06 +0200 2015-10-19 [INFO] Downloading glidein
12:19:07 +0200 2015-10-19 [INFO] Running glidein (check logs)
12:25:01 +0200 2015-10-19 [INFO] CMS glidein Run 14 ended
12:26:01 +0200 2015-10-19 [INFO] Starting CMS Application - Run 15
12:26:01 +0200 2015-10-19 [INFO] Reading the BOINC volunteer's information
12:26:02 +0200 2015-10-19 [INFO] Volunteer: Yeti (250) Host: 495
12:26:02 +0200 2015-10-19 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
12:26:02 +0200 2015-10-19 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=Yeti 250/CN=30085940
issuer : /O=Volunteer Computing/O=CERN/CN=Yeti 250
identity : /O=Volunteer Computing/O=CERN/CN=Yeti 250
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 130:00:00 (5.4 days)
12:26:02 +0200 2015-10-19 [INFO] Downloading glidein
12:26:03 +0200 2015-10-19 [INFO] Running glidein (check logs)
12:31:01 +0200 2015-10-19 [INFO] CMS glidein Run 15 ended
12:32:01 +0200 2015-10-19 [INFO] Starting CMS Application - Run 16
12:32:01 +0200 2015-10-19 [INFO] Reading the BOINC volunteer's information
12:32:02 +0200 2015-10-19 [INFO] Volunteer: Yeti (250) Host: 495
12:32:02 +0200 2015-10-19 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
12:32:02 +0200 2015-10-19 [INFO] Requesting an X509 credential
12:32:03 +0200 2015-10-19 [ERROR] Proxy error
12:32:03 +0200 2015-10-19 [INFO] Going to sleep for 1 hour
12:33:01 +0200 2015-10-19 [INFO] Starting CMS Application - Run 17
12:33:01 +0200 2015-10-19 [INFO] Reading the BOINC volunteer's information
12:33:02 +0200 2015-10-19 [INFO] Volunteer: Yeti (250) Host: 495
12:33:02 +0200 2015-10-19 [INFO] VMID: a248a608-bb13-4ecc-8fba-70015f0a4b90
12:33:02 +0200 2015-10-19 [INFO] Requesting an X509 credential
subject : /O=Volunteer Computing/O=CERN/CN=Yeti 250/CN=30085940
issuer : /O=Volunteer Computing/O=CERN/CN=Yeti 250
identity : /O=Volunteer Computing/O=CERN/CN=Yeti 250
type : RFC 3820 compliant impersonation proxy
strength : 1024 bits
path : /tmp/x509up_u500
timeleft : 129:53:00 (5.4 days)
12:33:02 +0200 2015-10-19 [INFO] Downloading glidein
12:33:03 +0200 2015-10-19 [INFO] Running glidein (check logs)

and in the stderr:

ERROR: Couldn't read proxy from: /tmp/x509up_u500
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line

Use -debug for further information.

ERROR: Couldn't read proxy from: /tmp/x509up_u500
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line

Use -debug for further information.

ERROR: Couldn't read proxy from: /tmp/x509up_u500
globus_credential: Error reading proxy credential
globus_credential: Error reading proxy credential: Couldn't read PEM from bio
OpenSSL Error: pem_lib.c:703: in library: PEM routines, function PEM_read_bio: no start line

Use -debug for further information.

-----------------------------

Should we go standby or stay beeing online ?
ID: 1258 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1259 - Posted: 19 Oct 2015, 12:16:21 UTC
Last modified: 19 Oct 2015, 12:17:48 UTC

Since I cannot see the consoles due to RDP not existing on my Windows 10 Home edition, I cannot understand what is going or not going on. I see the two consoles on the Challenge tasks using Databridge and I see the charts on standard vLHC@home, nothing on Atlas@home. I used to see all on Windows 8.1 provided by HP, then Microsoft "updated" my Windows and all went down the river.
Tullio
ID: 1259 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1260 - Posted: 19 Oct 2015, 14:53:32 UTC

Sorry, we are still having problems. There's a small batch of 5 running at the moment which is finishing jobs but failing on transfer to the databridge, so they are resubmitting to Condor. Awaiting the experts' decision on what to do next.
ID: 1260 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 1262 - Posted: 19 Oct 2015, 14:58:27 UTC

Okay, I have set "No New Work" for CMS.

Let me know if it makes sense to return to crunching
ID: 1262 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1263 - Posted: 20 Oct 2015, 1:43:56 UTC

OK, we are starting to get some successful jobs through now. However, there are a large number of curious failures leading to retries, so I'm not going to unleash a large batch just yet. I think I have to analyse the failures to see if there is any commonality -- it might be just one or two hosts misconfigured somehow, or short of the requirements. Many just stop within a few seconds of the main cmsRun process starting.
ID: 1263 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Rasputin42
Volunteer tester

Send message
Joined: 16 Aug 15
Posts: 966
Credit: 1,211,816
RAC: 0
Message 1291 - Posted: 22 Oct 2015, 19:48:30 UTC

It's hard to get the discussion going in the way we want, there is a lot of concern about the validity of the results [e.g. malicious Trojan horses] and this skews the dialogue


What about atlas? They do have the same situation. Is nobody concerned about that?
ID: 1291 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1293 - Posted: 22 Oct 2015, 20:31:56 UTC - in response to Message 1291.  

It's hard to get the discussion going in the way we want, there is a lot of concern about the validity of the results [e.g. malicious Trojan horses] and this skews the dialogue

What about atlas? They do have the same situation. Is nobody concerned about that?

That's probably something for someone like Ben to comment on, but from what I've seen their "success rate" is also less than what we have when things are going tickety-boo (not tohuwabohu!).
I am finally getting some analyses done, but severely hampered by a low-bandwidth connexion to our NFS file-server and at least three jobs wanting to use all of that 100 Mbps link! At the moment, eyeballing a few output graphs, the differences between results from CMS@Home and jobs submitted to the Grid are probably statistically insignificant. This may finally be the chance to use my infamous Energy Test for 2-D histogram comparisons in anger. :-)
There is the point, though, that one proposed use of CMS@Home is to look for very rare events. Currently one rogue result is buried in the statistics of tens of thousands of other jobs; when you are looking for one event in 100 million that rogue job might become more significant.
ID: 1293 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1294 - Posted: 22 Oct 2015, 21:35:53 UTC
Last modified: 22 Oct 2015, 21:37:32 UTC

I got a computation error with the new wrapper. vLHC@home and Atlas@home tasks running fine despite reboots by Windows 10 for unknown reasons, not related to updates,
Tullio
ID: 1294 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,937,121
RAC: 3,148
Message 1295 - Posted: 22 Oct 2015, 21:38:57 UTC - in response to Message 1294.  

I got a computation error with the new wrapper. vLHC@home and Atlas@home tasks running fine despite reboots by Windows 10 for unknown reasons, not related to updates,
Tullio
Have you got a job number for that, Tullio, so I can check my logs?
ID: 1295 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 754
Credit: 11,756,365
RAC: 8,667
Message 1304 - Posted: 23 Oct 2015, 4:29:48 UTC

http://boincai05.cern.ch/CMS-dev/results.php?userid=192

I have never had a problem here with these tasks using Win7 and Win10 and in the past a Win8.1

And Windows 10 does not do reboots every day so that is another problem.

I have 3 running Windows 10 right now and they all run vLHC,LHC,CMS,and Atlas 24/7
Mad Scientist For Life
ID: 1304 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 754
Credit: 11,756,365
RAC: 8,667
Message 1306 - Posted: 23 Oct 2015, 7:46:07 UTC - in response to Message 1304.  

(I guess I should add that after I said that I decided to update my VB version and some other updating on one of my Win7's so I had to do a clean install and as usual you may have to give it a couple tries to get back up and running so just ignore that part)
ID: 1306 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile tullio

Send message
Joined: 17 Aug 15
Posts: 62
Credit: 296,695
RAC: 0
Message 1308 - Posted: 23 Oct 2015, 8:05:53 UTC - in response to Message 1295.  

68091
ID: 1308 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile Magic Quantum Mechanic
Avatar

Send message
Joined: 8 Apr 15
Posts: 754
Credit: 11,756,365
RAC: 8,667
Message 1310 - Posted: 23 Oct 2015, 10:30:11 UTC - in response to Message 1308.  
Last modified: 23 Oct 2015, 10:43:15 UTC

68091



http://boincai05.cern.ch/CMS-dev/result.php?resultid=68091

Ah OK I see what you are trying to say.

Well it does start off by saying "Exit status 194 (0xc2) EXIT_ABORTED_BY_CLIENT"

And then Error Description: The session is not locked (session state: Unlocked)

Did you check the VB Manager before starting a new task?

VB never likes starting and stopping and reboots during tasks and VB is famous for having problems before anything else on your pc......you do a security scan of any type recently?

I usually haven't used the newest versions of VB or Boinc over the years but lately since I run so many VB tasks I have been just to see if they are more reliable.

No problems with the newest VB or Boinc lately.
ID: 1310 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
1 · 2 · Next

Message boards : News : New developments


©2024 CERN