Thread 'Suspended WUs do not crunch anymore if re-enabled'

Author	Message
Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 602 - Posted: 16 Aug 2015, 16:26:21 UTC On my Laptop I have to suspend BOINC and it's Tasks twice a day (Leaving Home for Office and vive versa). If I start from scratch, all works fine. CMS is downloading and crunching Jobs, all is as it should. When I now suspend BOINC (or the CMS-WU) and later on re-enable it, the VM will not fetch (or get) any work. I have tested and seen this several times now. I tried to shut down the VM with "shutdown with ACPI (or similar) and could see the VM shutting itself down. It is ending with a Kernel Panic. Restarting it with BOINC doesn't Change anything. I aborted the running WU, BOINC downloaded the next one but this although didn't Change anything. This will only Change if I reset the whole CMS-Project. I Keep the Laptop in this state so you can ask me for more Details or tell me, what you want me to do. Or we make a remote-session and you check yourself what is Happening. ID: 602 · Rating: 0 · rate: / Reply Quote

Phil Send message Joined: 9 Apr 15 Posts: 57 Credit: 230,221 RAC: 0	Message 603 - Posted: 16 Aug 2015, 18:28:46 UTC - in response to Message 602. I could understand this happening if you change the IP Address - but it ought start a new BOINC job okay? ID: 603 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 611 - Posted: 17 Aug 2015, 10:35:45 UTC - in response to Message 603. Last modified: 17 Aug 2015, 10:40:07 UTC Okay, now I'm back in the Office and still nothing works. The VM is turning through an endless Loop, I see several Tasks coming up the list, e.g. cvmfs2 Condor_master rcu_sched python ... Okay, more suggestiones ? ID: 611 · Rating: 0 · rate: / Reply Quote

Ben Segal Volunteer moderator Volunteer developer Volunteer tester Send message Joined: 12 Sep 14 Posts: 65 Credit: 544 RAC: 0	Message 616 - Posted: 17 Aug 2015, 12:02:44 UTC - in response to Message 611. Okay, now I'm back in the Office and still nothing works. The VM is turning through an endless Loop, I see several Tasks coming up the list, e.g. cvmfs2 Condor_master rcu_sched python ... Okay, more suggestiones ? I confirm Yeti's problem. I began crunching a job this morning at CERN and my http://localhost:52628/logs/cmsRun-stdout.log got to: Beginning CMSSW wrapper script slc6_amd64_gcc491 scramv1 CMSSW Performing SCRAM setup... Completed SCRAM setup Retrieving SCRAM project... Untarring /home/boinc/CMSRun/glide_mD4lxg/execute/dir_7900/sandbox.tar.gz Completed SCRAM project Executing CMSSW cmsRun -j FrameworkJobReport.xml PSet.py Begin processing the 1st record. Run 1, Event 332001, LumiSection 3321 at 17-Aug-2015 12:19:56.705 CEST Begin processing the 2nd record. Run 1, Event 332002, LumiSection 3321 at 17-Aug-2015 12:20:15.149 CEST Begin processing the 3rd record. Run 1, Event 332003, LumiSection 3321 at 17-Aug-2015 12:20:23.110 CEST Begin processing the 4th record. Run 1, Event 332004, LumiSection 3321 at 17-Aug-2015 12:20:23.361 CEST Begin processing the 5th record. Run 1, Event 332005, LumiSection 3321 at 17-Aug-2015 12:20:25.690 CEST Begin processing the 6th record. Run 1, Event 332006, LumiSection 3321 at 17-Aug-2015 12:20:36.888 CEST Begin processing the 7th record. Run 1, Event 332007, LumiSection 3321 at 17-Aug-2015 12:20:40.238 CEST Begin processing the 8th record. Run 1, Event 332008, LumiSection 3321 at 17-Aug-2015 12:20:58.656 CEST Begin processing the 9th record. Run 1, Event 332009, LumiSection 3321 at 17-Aug-2015 12:20:59.812 CEST Begin processing the 10th record. Run 1, Event 332010, LumiSection 3321 at 17-Aug-2015 12:21:18.956 CEST Begin processing the 11th record. Run 1, Event 332011, LumiSection 3321 at 17-Aug-2015 12:21:30.309 CEST Begin processing the 12th record. Run 1, Event 332012, LumiSection 3321 at 17-Aug-2015 12:21:33.413 CEST Begin processing the 13th record. Run 1, Event 332013, LumiSection 3321 at 17-Aug-2015 12:21:34.948 CEST Begin processing the 14th record. Run 1, Event 332014, LumiSection 3321 at 17-Aug-2015 12:21:40.721 CEST Begin processing the 15th record. Run 1, Event 332015, LumiSection 3321 at 17-Aug-2015 12:21:43.875 CEST Begin processing the 16th record. Run 1, Event 332016, LumiSection 3321 at 17-Aug-2015 12:21:54.981 CEST Begin processing the 17th record. Run 1, Event 332017, LumiSection 3321 at 17-Aug-2015 12:22:00.220 CEST Begin processing the 18th record. Run 1, Event 332018, LumiSection 3321 at 17-Aug-2015 12:22:02.181 CEST Begin processing the 19th record. Run 1, Event 332019, LumiSection 3321 at 17-Aug-2015 12:22:02.278 CEST Begin processing the 20th record. Run 1, Event 332020, LumiSection 3321 at 17-Aug-2015 12:22:02.283 CEST Begin processing the 21st record. Run 1, Event 332021, LumiSection 3321 at 17-Aug-2015 12:22:30.825 CEST Begin processing the 22nd record. Run 1, Event 332022, LumiSection 3321 at 17-Aug-2015 12:22:36.779 CEST Begin processing the 23rd record. Run 1, Event 332023, LumiSection 3321 at 17-Aug-2015 12:22:38.556 CEST Begin processing the 24th record. Run 1, Event 332024, LumiSection 3321 at 17-Aug-2015 12:22:39.363 CEST Begin processing the 25th record. Run 1, Event 332025, LumiSection 3321 at 17-Aug-2015 12:22:46.413 CEST Begin processing the 26th record. Run 1, Event 332026, LumiSection 3321 at 17-Aug-2015 12:23:18.093 CEST Begin processing the 27th record. Run 1, Event 332027, LumiSection 3321 at 17-Aug-2015 12:23:22.806 CEST Begin processing the 28th record. Run 1, Event 332028, LumiSection 3321 at 17-Aug-2015 12:23:23.711 CEST Then I suspended the task with the BOINC manager and went to lunch. No change of PC location, I'm still at CERN. After lunch I did a BOINC task resume and got: Begin processing the 29th record. Run 1, Event 332029, LumiSection 3321 at 17-Aug-2015 13:44:39.424 CEST It's hanging there now, but still using 100% CPU. What is it doing? This looks pretty strange to me… Ben ID: 616 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 627 - Posted: 17 Aug 2015, 14:32:01 UTC Okay, I found a Little bit more out. Below you see a grafic from PCs connecting to 130.246.180.119 + 120 At around 15:53 I suspended the Laptop and it's endless Loop and now the contacts are over. So, in the endless Loop, the Laptop is downloading something ID: 627 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 638 - Posted: 17 Aug 2015, 17:28:44 UTC - in response to Message 627. Yes, we're chasing this but no handle on it just yet. In the meantime we advise against suspending tasks in the middle of a job (i.e. when cmsRun is showing on the ALT+F3 console). Unfortunately the jobs I sent the other day are a bit longer than previously -- they will take about two hours or so, depending on your processor. Sorry about that, but this is what the development phase is for. Actually, some of the Condor jobs have been running considerably longer than that; I'll see if I can kill off the current set and send some shorter jobs. OK, shorter jobs now in the queue. ID: 638 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 639 - Posted: 17 Aug 2015, 17:51:18 UTC - in response to Message 638. In the meantime we advise against suspending tasks in the middle of a job (i.e. when cmsRun is showing on the ALT+F3 console). There is a big Problem: Some of my Desktops are rebooted once every day. And CMS looses the VM-Console, I can see nothing on it and no ALT + Fx is working. I don't know, if the machine is doing well or not after a reboot. I have rebooted my Laptop twice; now I see it is doing something, but I can not Access it. And in cmsRun-stdout.log is now new entry; it stays with the last entry from reboot :-( ID: 639 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 640 - Posted: 17 Aug 2015, 18:23:41 UTC From 4 machines that got rebooted, 2 have lost Connections from VM-Console and 2 are running as if they hadn't been rebootet. The only good Thing: It is enough to cancel the running BOINC-WU, after BOING having initiated the next VM all seems to work fine again ID: 640 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 641 - Posted: 17 Aug 2015, 18:49:25 UTC - in response to Message 640. From 4 machines that got rebooted, 2 have lost Connections from VM-Console and 2 are running as if they hadn't been rebootet. The only good Thing: It is enough to cancel the running BOINC-WU, after BOING having initiated the next VM all seems to work fine again Don't know if this is relevant, but the two that have survived the reboot are 4.3.12 (!), the two that didn't come up well are 4.3.28 and 4.3.30 ID: 641 · Rating: 0 · rate: / Reply Quote

Crystal Pellet Volunteer tester Send message Joined: 13 Feb 15 Posts: 1279 Credit: 1,047,416 RAC: 59	Message 644 - Posted: 17 Aug 2015, 21:05:54 UTC Last modified: 17 Aug 2015, 22:00:38 UTC Found suspicious lines in some logs. Did not dig further into it, but it seems that current cmsRun using >97% but the cmsRun-stdout.log has stopped about 2 hours ago after processing the 89th record. Log _condor_stdout ends with: ./CMSRunAnalysis.sh: line 162: 7739 Killed python CMSRunAnalysis.py -r "`pwd`" "$@" + jobrc=137 + set +x == The job had an exit code of 137 date: write error: Broken pipe and in the log MasterLog 2 hours ago: 08/17/15 18:59:21 (pid:7561) condor_write(): Socket closed when trying to write 2806 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10, errno=104 Connection reset by peer 08/17/15 18:59:21 (pid:7561) Buf::write(): condor_write() failed I'll reboot the VM and see, whether real running is resolved. Edit: After the reboot cmsRun is running again and this time ALT+F4 shows the output of the log named runGlideinout and ALT+F5 shows the output of cmsRun-stdout.log where most of the times those keystrokes give blank output. Edit2: First CMS-job returned and 2nd started. ID: 644 · Rating: 0 · rate: / Reply Quote

ivan Volunteer moderator Project administrator Project developer Project tester Project scientist Send message Joined: 20 Jan 15 Posts: 1156 Credit: 8,453,729 RAC: 28	Message 645 - Posted: 17 Aug 2015, 22:06:52 UTC - in response to Message 644. Sorry, CP, that was probably me killing the too-large jobs that I'd submitted the other day so that I could send smaller ones. The ral.ac.uk machine is the server that manages our jobs as Condor tasks. ID: 645 · Rating: 0 · rate: / Reply Quote

m Volunteer tester Send message Joined: 20 Mar 15 Posts: 243 Credit: 901,716 RAC: 0	Message 653 - Posted: 18 Aug 2015, 10:58:09 UTC - in response to Message 641. Don't know if this is relevant, but the two that have survived the reboot are 4.3.12 (!), the two that didn't come up well are 4.3.28 and 4.3.30 Similar problems here. All hosts are normally shut down & powered off during the day. Doesn't always happen. VBox is 4.3.26 on some and 4.3.26r98988 on others. Don't know what the difference is nor how come I've not got the same VB on all. I haven't checked to see if both Linux and Win are affected or if particular hosts are affected most. Most of them don't have much RAM so maybe that is a complication here. Unfortunately it will be tomorrow afternoon before I can have a good daytime look. ID: 653 · Rating: 0 · rate: / Reply Quote

Yeti Send message Joined: 29 May 15 Posts: 163 Credit: 3,560,947 RAC: 8,505	Message 658 - Posted: 18 Aug 2015, 13:59:16 UTC Okay, as requested in an other thread I had resetted my CMS-Project on the Laptop and in deed, now it works much better ! Switching from Office to Home and back didn't make a Problem, if I only "Exit" BOINC it seems to be fine (okay, I loose the crunching power of the actual run), but the VM keeps functional. So, I will not use "suspend" until we are told that it should work now. Cruncher, that have participated earlier than Monday this week you should considder resetting your project ID: 658 · Rating: 0 · rate: / Reply Quote

Development for LHC@home