21) Message boards : Theory Application : Disruption on Wednesday (Message 3531)
Posted 2 Jun 2016 by Ben Segal
Post:
The firewall is open but we are still having connectivity issues. It seems to be working internally but not at home.

You are right… I'm getting from home:

06/02/16 10:18:47 Changing activity: Benchmarking -> Idle
06/02/16 10:20:41 attempt to connect to <128.142.141.53:9618> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (172 to go).
22) Message boards : Theory Application : Errors in log (Message 3434)
Posted 20 May 2016 by Ben Segal
Post:
Don't worry, it's just simulating LHCb data for MCPlots.
23) Message boards : Theory Application : Endless Theory job (Message 3367)
Posted 16 May 2016 by Ben Segal
Post:
It seems, the same job is run numerous times, one after the other.
Exact same parameters, exat same log lenght.
Up to 12 times.

Is that deliberate??

Prepare Rivet parameters ...
analysesNames=CDF_2000_S4155203


Same across even different tasks.

We are looking at this. It is probably a Condor rerun when a job terminates with a nonzero code, and looping would do that.
24) Message boards : Theory Application : MCPlots Integration (Message 3337)
Posted 13 May 2016 by Ben Segal
Post:
The today returned jobs are not added to MCPlots.
MCPlots JobID's: 30260785, 30260991, 30260549 and 11 others.

http://mcplots-dev.cern.ch/production.php?view=user&userid=-38

Edit: Someone somehow fixed it. Suddenly 17 jobs appeared.

Yes, the job outputs are still being checked manually before being copied into MCPlots, and this explains the delay you have seen. This step will soon be eliminated as we think the whole chain is working well now.
25) Message boards : Theory Application : Suspend/Resume Theory (Message 3316)
Posted 12 May 2016 by Ben Segal
Post:
We are working on the current problem with uploading results.

Problem now solved - system should be working well again.
26) Message boards : News : Project Configuration Update 2 (Message 3315)
Posted 12 May 2016 by Ben Segal
Post:
Yes, why only one has started? If I download Atlas@home all 4 tasks start. Is this related to the app_config.xml file which was downloaded?
Tullio

Yes, that is the reason.
27) Message boards : Theory Application : Suspend/Resume Theory (Message 3311)
Posted 11 May 2016 by Ben Segal
Post:
We are working on the current problem with uploading results.
28) Message boards : Theory Application : Open Issues (Message 2980)
Posted 23 Apr 2016 by Ben Segal
Post:
I have allocated 2 cores to the task.
The load average is very high (15min average somtimes up to 1.81)
I can only speculate how high it might be with just one core.
(Maybe , i try that next)


I have tested it with only one core.
Load average, as expected, very high. (1.85- 1.93 15min load average).

This is with agile-runmc app on two tasks.

The apps are still nowhere to be found in the logs

Theory apps are dual-threaded, but the second thread is only used for graphics generation and uses less than half a CPU. In the past (back in the days of cernvmwrapper) we tried allocating 2 cores per task but discontinued it as it wasted half a CPU on average.

The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it.
29) Message boards : Theory Application : Task not starting, shutting down (Message 2801)
Posted 16 Apr 2016 by Ben Segal
Post:
Jobs are being put in the queue again now...


Seems, we are out of jobs again.
Tasks are cycling.

It's the weekend and this is still under dev so please relax. It's too sunny here to work (:-))..
30) Message boards : Theory Application : Task not starting, shutting down (Message 2773)
Posted 15 Apr 2016 by Ben Segal
Post:
Yes, this is the new procedure we invoke when there are no jobs in the queue and hence no VM activity. It's to avoid the situation we still have on the production T4T system where a task runs (even up to 24 hours) but no real work is being done by the VM. We are working to produce a user-level error message when this happens.

Jobs are being put in the queue again now...
31) Message boards : Theory Application : The Theory Application (Message 2690)
Posted 12 Apr 2016 by Ben Segal
Post:
The VM shutdowning itself seems be solved. Very well!

The Sherpa's doesn't run well, but my first Pythia6 and Pythia8 do.

===> [runRivet] Tue Apr 12 16:31:15 CEST 2016 [boinc ppbar uemb-hard 1800 - - pythia6 6.428 391 100000 188]

===> [runRivet] Tue Apr 12 16:52:55 CEST 2016 [boinc ppbar uemb-hard 1800 15 - pythia8 8.186 tune-4c 100000 188]

Yes CP (and Ray), the first "real" T4T jobs have been submitted and the web logs are also working. The next step is to feed the results back into MCPlots, which Leonardo is doing currently, so you will start to get MCPlots stats updates for the work you do.

A lot of progress today!

So Rasputin, you can also restart testing now...
32) Message boards : Theory Application : The Theory Application (Message 2674)
Posted 12 Apr 2016 by Ben Segal
Post:
Would you please inform us, when an updated version is fed into the system?
This way, we would not unneccessarily waste time with tasks, that will not work.

OK, will do! By the way, the current test jobs are identical and fail after 1-2 minutes. This is intentional to test the system setup and recovery features.

Thanks for all your help!
33) Message boards : Theory Application : The Theory Application (Message 2672)
Posted 12 Apr 2016 by Ben Segal
Post:
To Theory-dev testers:

This is very early days so please bear with us. Here is some information:

1. The VM is 64 bits (yes!)

2. The job feeding system is now Condor based (like CMS) instead of CoPilot.
(This means that job initiation, suspend/resume timeouts, and other things may be less robust for now than you are used to with CoPilot).

3. The VM screens and Web logs are being worked on so don't complain about them yet.

The idea is to standardise the whole series of CERN VM based apps as much as possible…

Ben, Laurence, Leonardo and team
34) Message boards : News : No new jobs (Message 2629)
Posted 10 Apr 2016 by Ben Segal
Post:
That appears to have been fixed. I sent Laurence et al. an email early this morning -- I'd had a late night...

Actually some CMS tasks are going out on vLHCathome, but we have a server configuration problem over there which we will look at after the weekend. For example I get only CMS tasks and no Theory Simulation tasks myself...

Patience please!
35) Message boards : Number crunching : Expect errors eventually (Message 1504)
Posted 4 Dec 2015 by Ben Segal
Post:
... I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017.

Hope you'll still be able to read that small print when the day comes… I can (just!) and my day came in 2002....

Ben
36) Message boards : News : Agent Broken (Message 726)
Posted 20 Aug 2015 by Ben Segal
Post:
I just did a fresh install of BOINC, added CMS-dev and it worked. For those who are experiencing issues I would suggest aborting the current task.

I rebooted my VM which had not worked yesterday and today I successfully ran a 200-record job and it staged out correct results. I didn't suspend anything while it ran.

I will do some investigations shortly on the suspend/resume situation. But so far I can say:

1. When the job was running it had 5 open tcp connections to lcggwms02.gridpp.ral.ac.uk (all on port 9619) and 1 to port 9818.

2. The web logs such as cmsRun-stdout.log and _condor_stdout were written to the end of the job and the stageout, but did not get renewed when the next job started. But maybe I hadn't waited long enough for the next job to begin as so far it's hard to know what's going on due to poor logging - see next point:

3. You badly need a live VM console showing at least the cmsRun-stdout file, plus some files showing the job handling. In general the present VM consoles aren't optimal choices IMHO…

Ben
37) Message boards : Number crunching : Please do a project reset. (Message 625)
Posted 17 Aug 2015 by Ben Segal
Post:
The development team have advised that all volunteers should do a project reset. They feel that this may cure the problems that Ben and Yeti have seen lately:

Please can you ask all volunteers to do a project reset. Ben had an issue
for a while and it was solved by a project reset. I think it was due to the
match making and the disk drive size. Although we updated the app, if they
download the old image (which didn't change) and it was initially set up
with 10 GB the app update may not automatically result in the disk expanding
to 20GB.

Sorry Ivan, the project reset did indeed allow me to get going again after the Condor introduction, but the problem I now have with suspend/resume blocking progress is since that reset.

Ben
38) Message boards : Number crunching : Suspended WUs do not crunch anymore if re-enabled (Message 616)
Posted 17 Aug 2015 by Ben Segal
Post:
Okay, now I'm back in the Office and still nothing works.

The VM is turning through an endless Loop, I see several Tasks coming up the list, e.g.
cvmfs2
Condor_master
rcu_sched
python
...

Okay, more suggestiones ?

I confirm Yeti's problem. I began crunching a job this morning at CERN and my http://localhost:52628/logs/cmsRun-stdout.log got to:

Beginning CMSSW wrapper script
slc6_amd64_gcc491 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_mD4lxg/execute/dir_7900/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
Begin processing the 1st record. Run 1, Event 332001, LumiSection 3321 at 17-Aug-2015 12:19:56.705 CEST
Begin processing the 2nd record. Run 1, Event 332002, LumiSection 3321 at 17-Aug-2015 12:20:15.149 CEST
Begin processing the 3rd record. Run 1, Event 332003, LumiSection 3321 at 17-Aug-2015 12:20:23.110 CEST
Begin processing the 4th record. Run 1, Event 332004, LumiSection 3321 at 17-Aug-2015 12:20:23.361 CEST
Begin processing the 5th record. Run 1, Event 332005, LumiSection 3321 at 17-Aug-2015 12:20:25.690 CEST
Begin processing the 6th record. Run 1, Event 332006, LumiSection 3321 at 17-Aug-2015 12:20:36.888 CEST
Begin processing the 7th record. Run 1, Event 332007, LumiSection 3321 at 17-Aug-2015 12:20:40.238 CEST
Begin processing the 8th record. Run 1, Event 332008, LumiSection 3321 at 17-Aug-2015 12:20:58.656 CEST
Begin processing the 9th record. Run 1, Event 332009, LumiSection 3321 at 17-Aug-2015 12:20:59.812 CEST
Begin processing the 10th record. Run 1, Event 332010, LumiSection 3321 at 17-Aug-2015 12:21:18.956 CEST
Begin processing the 11th record. Run 1, Event 332011, LumiSection 3321 at 17-Aug-2015 12:21:30.309 CEST
Begin processing the 12th record. Run 1, Event 332012, LumiSection 3321 at 17-Aug-2015 12:21:33.413 CEST
Begin processing the 13th record. Run 1, Event 332013, LumiSection 3321 at 17-Aug-2015 12:21:34.948 CEST
Begin processing the 14th record. Run 1, Event 332014, LumiSection 3321 at 17-Aug-2015 12:21:40.721 CEST
Begin processing the 15th record. Run 1, Event 332015, LumiSection 3321 at 17-Aug-2015 12:21:43.875 CEST
Begin processing the 16th record. Run 1, Event 332016, LumiSection 3321 at 17-Aug-2015 12:21:54.981 CEST
Begin processing the 17th record. Run 1, Event 332017, LumiSection 3321 at 17-Aug-2015 12:22:00.220 CEST
Begin processing the 18th record. Run 1, Event 332018, LumiSection 3321 at 17-Aug-2015 12:22:02.181 CEST
Begin processing the 19th record. Run 1, Event 332019, LumiSection 3321 at 17-Aug-2015 12:22:02.278 CEST
Begin processing the 20th record. Run 1, Event 332020, LumiSection 3321 at 17-Aug-2015 12:22:02.283 CEST
Begin processing the 21st record. Run 1, Event 332021, LumiSection 3321 at 17-Aug-2015 12:22:30.825 CEST
Begin processing the 22nd record. Run 1, Event 332022, LumiSection 3321 at 17-Aug-2015 12:22:36.779 CEST
Begin processing the 23rd record. Run 1, Event 332023, LumiSection 3321 at 17-Aug-2015 12:22:38.556 CEST
Begin processing the 24th record. Run 1, Event 332024, LumiSection 3321 at 17-Aug-2015 12:22:39.363 CEST
Begin processing the 25th record. Run 1, Event 332025, LumiSection 3321 at 17-Aug-2015 12:22:46.413 CEST
Begin processing the 26th record. Run 1, Event 332026, LumiSection 3321 at 17-Aug-2015 12:23:18.093 CEST
Begin processing the 27th record. Run 1, Event 332027, LumiSection 3321 at 17-Aug-2015 12:23:22.806 CEST
Begin processing the 28th record. Run 1, Event 332028, LumiSection 3321 at 17-Aug-2015 12:23:23.711 CEST

Then I suspended the task with the BOINC manager and went to lunch. No change of PC location, I'm still at CERN. After lunch I did a BOINC task resume and got:

Begin processing the 29th record. Run 1, Event 332029, LumiSection 3321 at 17-Aug-2015 13:44:39.424 CEST

It's hanging there now, but still using 100% CPU.

What is it doing? This looks pretty strange to me…

Ben
39) Message boards : Number crunching : New testers, please post here (Message 419)
Posted 29 May 2015 by Ben Segal
Post:
Hi, I am a fresh and new user ;-)

Downloaded the first WU and has already started to crunch.

I looked inside the VM (as learned by vLHC) and it is saying something like: Begin processing the 6th record. Run 1, Event 906, ...

So, I think all is running as it should ?

Cheers, Yeti

Yep, right on!

Ben
40) Message boards : News : Urgent Update for Windows Users (Message 406)
Posted 26 May 2015 by Ben Segal
Post:
...
Snapshots can be a problem and we got rid of those at vLHC with the Databridge.
...

Just FYI, we got rid of snapshots with a vboxwrapper upgrade on vLHC and here for all WU's, nothing to do with DataBridge.

Ben


Previous 20 · Next 20


©2024 CERN