21)
Message boards :
Theory Application :
Disruption on Wednesday
(Message 3531)
Posted 2 Jun 2016 by Ben Segal Post: The firewall is open but we are still having connectivity issues. It seems to be working internally but not at home. You are right… I'm getting from home: 06/02/16 10:18:47 Changing activity: Benchmarking -> Idle 06/02/16 10:20:41 attempt to connect to <128.142.141.53:9618> failed: Connection timed out (connect errno = 110). Will keep trying for 300 total seconds (172 to go). |
22)
Message boards :
Theory Application :
Errors in log
(Message 3434)
Posted 20 May 2016 by Ben Segal Post: Don't worry, it's just simulating LHCb data for MCPlots. |
23)
Message boards :
Theory Application :
Endless Theory job
(Message 3367)
Posted 16 May 2016 by Ben Segal Post: It seems, the same job is run numerous times, one after the other. We are looking at this. It is probably a Condor rerun when a job terminates with a nonzero code, and looping would do that. |
24)
Message boards :
Theory Application :
MCPlots Integration
(Message 3337)
Posted 13 May 2016 by Ben Segal Post: The today returned jobs are not added to MCPlots. Yes, the job outputs are still being checked manually before being copied into MCPlots, and this explains the delay you have seen. This step will soon be eliminated as we think the whole chain is working well now. |
25)
Message boards :
Theory Application :
Suspend/Resume Theory
(Message 3316)
Posted 12 May 2016 by Ben Segal Post: We are working on the current problem with uploading results. Problem now solved - system should be working well again. |
26)
Message boards :
News :
Project Configuration Update 2
(Message 3315)
Posted 12 May 2016 by Ben Segal Post: Yes, why only one has started? If I download Atlas@home all 4 tasks start. Is this related to the app_config.xml file which was downloaded? Yes, that is the reason. |
27)
Message boards :
Theory Application :
Suspend/Resume Theory
(Message 3311)
Posted 11 May 2016 by Ben Segal Post: We are working on the current problem with uploading results. |
28)
Message boards :
Theory Application :
Open Issues
(Message 2980)
Posted 23 Apr 2016 by Ben Segal Post: I have allocated 2 cores to the task. Theory apps are dual-threaded, but the second thread is only used for graphics generation and uses less than half a CPU. In the past (back in the days of cernvmwrapper) we tried allocating 2 cores per task but discontinued it as it wasted half a CPU on average. The numerical value of "load average" in any case doesn't map exactly to the number of CPU's loaded, so don't worry too much about it. |
29)
Message boards :
Theory Application :
Task not starting, shutting down
(Message 2801)
Posted 16 Apr 2016 by Ben Segal Post: Jobs are being put in the queue again now... It's the weekend and this is still under dev so please relax. It's too sunny here to work (:-)).. |
30)
Message boards :
Theory Application :
Task not starting, shutting down
(Message 2773)
Posted 15 Apr 2016 by Ben Segal Post: Yes, this is the new procedure we invoke when there are no jobs in the queue and hence no VM activity. It's to avoid the situation we still have on the production T4T system where a task runs (even up to 24 hours) but no real work is being done by the VM. We are working to produce a user-level error message when this happens. Jobs are being put in the queue again now... |
31)
Message boards :
Theory Application :
The Theory Application
(Message 2690)
Posted 12 Apr 2016 by Ben Segal Post: The VM shutdowning itself seems be solved. Very well! Yes CP (and Ray), the first "real" T4T jobs have been submitted and the web logs are also working. The next step is to feed the results back into MCPlots, which Leonardo is doing currently, so you will start to get MCPlots stats updates for the work you do. A lot of progress today! So Rasputin, you can also restart testing now... |
32)
Message boards :
Theory Application :
The Theory Application
(Message 2674)
Posted 12 Apr 2016 by Ben Segal Post: Would you please inform us, when an updated version is fed into the system? OK, will do! By the way, the current test jobs are identical and fail after 1-2 minutes. This is intentional to test the system setup and recovery features. Thanks for all your help! |
33)
Message boards :
Theory Application :
The Theory Application
(Message 2672)
Posted 12 Apr 2016 by Ben Segal Post: To Theory-dev testers: This is very early days so please bear with us. Here is some information: 1. The VM is 64 bits (yes!) 2. The job feeding system is now Condor based (like CMS) instead of CoPilot. (This means that job initiation, suspend/resume timeouts, and other things may be less robust for now than you are used to with CoPilot). 3. The VM screens and Web logs are being worked on so don't complain about them yet. The idea is to standardise the whole series of CERN VM based apps as much as possible… Ben, Laurence, Leonardo and team |
34)
Message boards :
News :
No new jobs
(Message 2629)
Posted 10 Apr 2016 by Ben Segal Post: That appears to have been fixed. I sent Laurence et al. an email early this morning -- I'd had a late night... Actually some CMS tasks are going out on vLHCathome, but we have a server configuration problem over there which we will look at after the weekend. For example I get only CMS tasks and no Theory Simulation tasks myself... Patience please! |
35)
Message boards :
Number crunching :
Expect errors eventually
(Message 1504)
Posted 4 Dec 2015 by Ben Segal Post: ... I don't expect it to be implemented anytime soon, perhaps not even before I retire. :-( Which might be April 2017. Hope you'll still be able to read that small print when the day comes… I can (just!) and my day came in 2002.... Ben |
36)
Message boards :
News :
Agent Broken
(Message 726)
Posted 20 Aug 2015 by Ben Segal Post: I just did a fresh install of BOINC, added CMS-dev and it worked. For those who are experiencing issues I would suggest aborting the current task. I rebooted my VM which had not worked yesterday and today I successfully ran a 200-record job and it staged out correct results. I didn't suspend anything while it ran. I will do some investigations shortly on the suspend/resume situation. But so far I can say: 1. When the job was running it had 5 open tcp connections to lcggwms02.gridpp.ral.ac.uk (all on port 9619) and 1 to port 9818. 2. The web logs such as cmsRun-stdout.log and _condor_stdout were written to the end of the job and the stageout, but did not get renewed when the next job started. But maybe I hadn't waited long enough for the next job to begin as so far it's hard to know what's going on due to poor logging - see next point: 3. You badly need a live VM console showing at least the cmsRun-stdout file, plus some files showing the job handling. In general the present VM consoles aren't optimal choices IMHO… Ben |
37)
Message boards :
Number crunching :
Please do a project reset.
(Message 625)
Posted 17 Aug 2015 by Ben Segal Post: The development team have advised that all volunteers should do a project reset. They feel that this may cure the problems that Ben and Yeti have seen lately: Sorry Ivan, the project reset did indeed allow me to get going again after the Condor introduction, but the problem I now have with suspend/resume blocking progress is since that reset. Ben |
38)
Message boards :
Number crunching :
Suspended WUs do not crunch anymore if re-enabled
(Message 616)
Posted 17 Aug 2015 by Ben Segal Post: Okay, now I'm back in the Office and still nothing works. I confirm Yeti's problem. I began crunching a job this morning at CERN and my http://localhost:52628/logs/cmsRun-stdout.log got to: Beginning CMSSW wrapper script slc6_amd64_gcc491 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_mD4lxg/execute/dir_7900/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Begin processing the 1st record. Run 1, Event 332001, LumiSection 3321 at 17-Aug-2015 12:19:56.705 CEST Begin processing the 2nd record. Run 1, Event 332002, LumiSection 3321 at 17-Aug-2015 12:20:15.149 CEST Begin processing the 3rd record. Run 1, Event 332003, LumiSection 3321 at 17-Aug-2015 12:20:23.110 CEST Begin processing the 4th record. Run 1, Event 332004, LumiSection 3321 at 17-Aug-2015 12:20:23.361 CEST Begin processing the 5th record. Run 1, Event 332005, LumiSection 3321 at 17-Aug-2015 12:20:25.690 CEST Begin processing the 6th record. Run 1, Event 332006, LumiSection 3321 at 17-Aug-2015 12:20:36.888 CEST Begin processing the 7th record. Run 1, Event 332007, LumiSection 3321 at 17-Aug-2015 12:20:40.238 CEST Begin processing the 8th record. Run 1, Event 332008, LumiSection 3321 at 17-Aug-2015 12:20:58.656 CEST Begin processing the 9th record. Run 1, Event 332009, LumiSection 3321 at 17-Aug-2015 12:20:59.812 CEST Begin processing the 10th record. Run 1, Event 332010, LumiSection 3321 at 17-Aug-2015 12:21:18.956 CEST Begin processing the 11th record. Run 1, Event 332011, LumiSection 3321 at 17-Aug-2015 12:21:30.309 CEST Begin processing the 12th record. Run 1, Event 332012, LumiSection 3321 at 17-Aug-2015 12:21:33.413 CEST Begin processing the 13th record. Run 1, Event 332013, LumiSection 3321 at 17-Aug-2015 12:21:34.948 CEST Begin processing the 14th record. Run 1, Event 332014, LumiSection 3321 at 17-Aug-2015 12:21:40.721 CEST Begin processing the 15th record. Run 1, Event 332015, LumiSection 3321 at 17-Aug-2015 12:21:43.875 CEST Begin processing the 16th record. Run 1, Event 332016, LumiSection 3321 at 17-Aug-2015 12:21:54.981 CEST Begin processing the 17th record. Run 1, Event 332017, LumiSection 3321 at 17-Aug-2015 12:22:00.220 CEST Begin processing the 18th record. Run 1, Event 332018, LumiSection 3321 at 17-Aug-2015 12:22:02.181 CEST Begin processing the 19th record. Run 1, Event 332019, LumiSection 3321 at 17-Aug-2015 12:22:02.278 CEST Begin processing the 20th record. Run 1, Event 332020, LumiSection 3321 at 17-Aug-2015 12:22:02.283 CEST Begin processing the 21st record. Run 1, Event 332021, LumiSection 3321 at 17-Aug-2015 12:22:30.825 CEST Begin processing the 22nd record. Run 1, Event 332022, LumiSection 3321 at 17-Aug-2015 12:22:36.779 CEST Begin processing the 23rd record. Run 1, Event 332023, LumiSection 3321 at 17-Aug-2015 12:22:38.556 CEST Begin processing the 24th record. Run 1, Event 332024, LumiSection 3321 at 17-Aug-2015 12:22:39.363 CEST Begin processing the 25th record. Run 1, Event 332025, LumiSection 3321 at 17-Aug-2015 12:22:46.413 CEST Begin processing the 26th record. Run 1, Event 332026, LumiSection 3321 at 17-Aug-2015 12:23:18.093 CEST Begin processing the 27th record. Run 1, Event 332027, LumiSection 3321 at 17-Aug-2015 12:23:22.806 CEST Begin processing the 28th record. Run 1, Event 332028, LumiSection 3321 at 17-Aug-2015 12:23:23.711 CEST Then I suspended the task with the BOINC manager and went to lunch. No change of PC location, I'm still at CERN. After lunch I did a BOINC task resume and got: Begin processing the 29th record. Run 1, Event 332029, LumiSection 3321 at 17-Aug-2015 13:44:39.424 CEST It's hanging there now, but still using 100% CPU. What is it doing? This looks pretty strange to me… Ben |
39)
Message boards :
Number crunching :
New testers, please post here
(Message 419)
Posted 29 May 2015 by Ben Segal Post: Hi, I am a fresh and new user ;-) Yep, right on! Ben |
40)
Message boards :
News :
Urgent Update for Windows Users
(Message 406)
Posted 26 May 2015 by Ben Segal Post: ... Just FYI, we got rid of snapshots with a vboxwrapper upgrade on vLHC and here for all WU's, nothing to do with DataBridge. Ben |
©2024 CERN