Message boards : Number crunching : Suspended WUs do not crunch anymore if re-enabled
Message board moderation

To post messages, you must log in.

AuthorMessage
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 602 - Posted: 16 Aug 2015, 16:26:21 UTC

On my Laptop I have to suspend BOINC and it's Tasks twice a day (Leaving Home for Office and vive versa).

If I start from scratch, all works fine. CMS is downloading and crunching Jobs, all is as it should.

When I now suspend BOINC (or the CMS-WU) and later on re-enable it, the VM will not fetch (or get) any work. I have tested and seen this several times now.

I tried to shut down the VM with "shutdown with ACPI (or similar) and could see the VM shutting itself down. It is ending with a Kernel Panic.

Restarting it with BOINC doesn't Change anything.

I aborted the running WU, BOINC downloaded the next one but this although didn't Change anything.

This will only Change if I reset the whole CMS-Project.

I Keep the Laptop in this state so you can ask me for more Details or tell me, what you want me to do. Or we make a remote-session and you check yourself what is Happening.
ID: 602 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Phil

Send message
Joined: 9 Apr 15
Posts: 57
Credit: 230,221
RAC: 0
Message 603 - Posted: 16 Aug 2015, 18:28:46 UTC - in response to Message 602.  

I could understand this happening if you change the IP Address - but it ought start a new BOINC job okay?
ID: 603 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 611 - Posted: 17 Aug 2015, 10:35:45 UTC - in response to Message 603.  
Last modified: 17 Aug 2015, 10:40:07 UTC

Okay, now I'm back in the Office and still nothing works.

The VM is turning through an endless Loop, I see several Tasks coming up the list, e.g.
cvmfs2
Condor_master
rcu_sched
python
...

Okay, more suggestiones ?
ID: 611 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Ben Segal
Volunteer moderator
Volunteer developer
Volunteer tester

Send message
Joined: 12 Sep 14
Posts: 65
Credit: 544
RAC: 0
Message 616 - Posted: 17 Aug 2015, 12:02:44 UTC - in response to Message 611.  

Okay, now I'm back in the Office and still nothing works.

The VM is turning through an endless Loop, I see several Tasks coming up the list, e.g.
cvmfs2
Condor_master
rcu_sched
python
...

Okay, more suggestiones ?

I confirm Yeti's problem. I began crunching a job this morning at CERN and my http://localhost:52628/logs/cmsRun-stdout.log got to:

Beginning CMSSW wrapper script
slc6_amd64_gcc491 scramv1 CMSSW
Performing SCRAM setup...
Completed SCRAM setup
Retrieving SCRAM project...
Untarring /home/boinc/CMSRun/glide_mD4lxg/execute/dir_7900/sandbox.tar.gz
Completed SCRAM project
Executing CMSSW
cmsRun -j FrameworkJobReport.xml PSet.py
Begin processing the 1st record. Run 1, Event 332001, LumiSection 3321 at 17-Aug-2015 12:19:56.705 CEST
Begin processing the 2nd record. Run 1, Event 332002, LumiSection 3321 at 17-Aug-2015 12:20:15.149 CEST
Begin processing the 3rd record. Run 1, Event 332003, LumiSection 3321 at 17-Aug-2015 12:20:23.110 CEST
Begin processing the 4th record. Run 1, Event 332004, LumiSection 3321 at 17-Aug-2015 12:20:23.361 CEST
Begin processing the 5th record. Run 1, Event 332005, LumiSection 3321 at 17-Aug-2015 12:20:25.690 CEST
Begin processing the 6th record. Run 1, Event 332006, LumiSection 3321 at 17-Aug-2015 12:20:36.888 CEST
Begin processing the 7th record. Run 1, Event 332007, LumiSection 3321 at 17-Aug-2015 12:20:40.238 CEST
Begin processing the 8th record. Run 1, Event 332008, LumiSection 3321 at 17-Aug-2015 12:20:58.656 CEST
Begin processing the 9th record. Run 1, Event 332009, LumiSection 3321 at 17-Aug-2015 12:20:59.812 CEST
Begin processing the 10th record. Run 1, Event 332010, LumiSection 3321 at 17-Aug-2015 12:21:18.956 CEST
Begin processing the 11th record. Run 1, Event 332011, LumiSection 3321 at 17-Aug-2015 12:21:30.309 CEST
Begin processing the 12th record. Run 1, Event 332012, LumiSection 3321 at 17-Aug-2015 12:21:33.413 CEST
Begin processing the 13th record. Run 1, Event 332013, LumiSection 3321 at 17-Aug-2015 12:21:34.948 CEST
Begin processing the 14th record. Run 1, Event 332014, LumiSection 3321 at 17-Aug-2015 12:21:40.721 CEST
Begin processing the 15th record. Run 1, Event 332015, LumiSection 3321 at 17-Aug-2015 12:21:43.875 CEST
Begin processing the 16th record. Run 1, Event 332016, LumiSection 3321 at 17-Aug-2015 12:21:54.981 CEST
Begin processing the 17th record. Run 1, Event 332017, LumiSection 3321 at 17-Aug-2015 12:22:00.220 CEST
Begin processing the 18th record. Run 1, Event 332018, LumiSection 3321 at 17-Aug-2015 12:22:02.181 CEST
Begin processing the 19th record. Run 1, Event 332019, LumiSection 3321 at 17-Aug-2015 12:22:02.278 CEST
Begin processing the 20th record. Run 1, Event 332020, LumiSection 3321 at 17-Aug-2015 12:22:02.283 CEST
Begin processing the 21st record. Run 1, Event 332021, LumiSection 3321 at 17-Aug-2015 12:22:30.825 CEST
Begin processing the 22nd record. Run 1, Event 332022, LumiSection 3321 at 17-Aug-2015 12:22:36.779 CEST
Begin processing the 23rd record. Run 1, Event 332023, LumiSection 3321 at 17-Aug-2015 12:22:38.556 CEST
Begin processing the 24th record. Run 1, Event 332024, LumiSection 3321 at 17-Aug-2015 12:22:39.363 CEST
Begin processing the 25th record. Run 1, Event 332025, LumiSection 3321 at 17-Aug-2015 12:22:46.413 CEST
Begin processing the 26th record. Run 1, Event 332026, LumiSection 3321 at 17-Aug-2015 12:23:18.093 CEST
Begin processing the 27th record. Run 1, Event 332027, LumiSection 3321 at 17-Aug-2015 12:23:22.806 CEST
Begin processing the 28th record. Run 1, Event 332028, LumiSection 3321 at 17-Aug-2015 12:23:23.711 CEST

Then I suspended the task with the BOINC manager and went to lunch. No change of PC location, I'm still at CERN. After lunch I did a BOINC task resume and got:

Begin processing the 29th record. Run 1, Event 332029, LumiSection 3321 at 17-Aug-2015 13:44:39.424 CEST

It's hanging there now, but still using 100% CPU.

What is it doing? This looks pretty strange to me…

Ben
ID: 616 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 627 - Posted: 17 Aug 2015, 14:32:01 UTC

Okay, I found a Little bit more out.

Below you see a grafic from PCs connecting to 130.246.180.119 + 120

At around 15:53 I suspended the Laptop and it's endless Loop and now the contacts are over.

So, in the endless Loop, the Laptop is downloading something

ID: 627 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,944,473
RAC: 3,018
Message 638 - Posted: 17 Aug 2015, 17:28:44 UTC - in response to Message 627.  

Yes, we're chasing this but no handle on it just yet. In the meantime we advise against suspending tasks in the middle of a job (i.e. when cmsRun is showing on the ALT+F3 console). Unfortunately the jobs I sent the other day are a bit longer than previously -- they will take about two hours or so, depending on your processor. Sorry about that, but this is what the development phase is for.
Actually, some of the Condor jobs have been running considerably longer than that; I'll see if I can kill off the current set and send some shorter jobs.

OK, shorter jobs now in the queue.
ID: 638 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 639 - Posted: 17 Aug 2015, 17:51:18 UTC - in response to Message 638.  

In the meantime we advise against suspending tasks in the middle of a job (i.e. when cmsRun is showing on the ALT+F3 console).

There is a big Problem: Some of my Desktops are rebooted once every day.

And CMS looses the VM-Console, I can see nothing on it and no ALT + Fx is working. I don't know, if the machine is doing well or not after a reboot.

I have rebooted my Laptop twice; now I see it is doing something, but I can not Access it. And in cmsRun-stdout.log is now new entry; it stays with the last entry from reboot :-(
ID: 639 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 640 - Posted: 17 Aug 2015, 18:23:41 UTC

From 4 machines that got rebooted, 2 have lost Connections from VM-Console and 2 are running as if they hadn't been rebootet.

The only good Thing: It is enough to cancel the running BOINC-WU, after BOING having initiated the next VM all seems to work fine again
ID: 640 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 641 - Posted: 17 Aug 2015, 18:49:25 UTC - in response to Message 640.  

From 4 machines that got rebooted, 2 have lost Connections from VM-Console and 2 are running as if they hadn't been rebootet.

The only good Thing: It is enough to cancel the running BOINC-WU, after BOING having initiated the next VM all seems to work fine again


Don't know if this is relevant, but the two that have survived the reboot are 4.3.12 (!), the two that didn't come up well are 4.3.28 and 4.3.30
ID: 641 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Crystal Pellet
Volunteer tester

Send message
Joined: 13 Feb 15
Posts: 1185
Credit: 849,977
RAC: 1,232
Message 644 - Posted: 17 Aug 2015, 21:05:54 UTC
Last modified: 17 Aug 2015, 22:00:38 UTC

Found suspicious lines in some logs. Did not dig further into it,
but it seems that current cmsRun using >97% but the cmsRun-stdout.log has stopped about 2 hours ago after processing the 89th record.

Log _condor_stdout ends with:
./CMSRunAnalysis.sh: line 162: 7739 Killed python CMSRunAnalysis.py -r "`pwd`" "$@"
+ jobrc=137
+ set +x
== The job had an exit code of 137
date: write error: Broken pipe


and

in the log MasterLog 2 hours ago:
08/17/15 18:59:21 (pid:7561) condor_write(): Socket closed when trying to write 2806 bytes to collector lcggwms02.gridpp.rl.ac.uk:9619, fd is 10, errno=104 Connection reset by peer
08/17/15 18:59:21 (pid:7561) Buf::write(): condor_write() failed


I'll reboot the VM and see, whether real running is resolved.

Edit: After the reboot cmsRun is running again and this time ALT+F4 shows the output of the log named runGlideinout
and ALT+F5 shows the output of cmsRun-stdout.log where most of the times those keystrokes give blank output.
Edit2: First CMS-job returned and 2nd started.
ID: 644 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Profile ivan
Volunteer moderator
Project administrator
Project developer
Project tester
Project scientist
Avatar

Send message
Joined: 20 Jan 15
Posts: 1129
Credit: 7,944,473
RAC: 3,018
Message 645 - Posted: 17 Aug 2015, 22:06:52 UTC - in response to Message 644.  

Sorry, CP, that was probably me killing the too-large jobs that I'd submitted the other day so that I could send smaller ones. The ral.ac.uk machine is the server that manages our jobs as Condor tasks.
ID: 645 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
m
Volunteer tester

Send message
Joined: 20 Mar 15
Posts: 243
Credit: 886,442
RAC: 90
Message 653 - Posted: 18 Aug 2015, 10:58:09 UTC - in response to Message 641.  

Don't know if this is relevant, but the two that have survived the reboot are 4.3.12 (!), the two that didn't come up well are 4.3.28 and 4.3.30


Similar problems here. All hosts are normally shut down & powered off during the day. Doesn't always happen. VBox is 4.3.26 on some and 4.3.26r98988 on others. Don't know what the difference is nor how come I've not got the same VB on all. I haven't checked to see if both Linux and Win are affected or if particular hosts are affected most. Most of them don't have much RAM so maybe that is a complication here. Unfortunately it will be tomorrow afternoon before I can have a good daytime look.
ID: 653 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote
Yeti
Avatar

Send message
Joined: 29 May 15
Posts: 147
Credit: 2,842,484
RAC: 0
Message 658 - Posted: 18 Aug 2015, 13:59:16 UTC

Okay, as requested in an other thread I had resetted my CMS-Project on the Laptop and in deed, now it works much better !

Switching from Office to Home and back didn't make a Problem, if I only "Exit" BOINC it seems to be fine (okay, I loose the crunching power of the actual run), but the VM keeps functional.

So, I will not use "suspend" until we are told that it should work now.

Cruncher, that have participated earlier than Monday this week you should considder resetting your project
ID: 658 · Rating: 0 · rate: Rate + / Rate - Report as offensive     Reply Quote

Message boards : Number crunching : Suspended WUs do not crunch anymore if re-enabled


©2024 CERN