[SERVER-16958] Hang on shutdown with WiredTiger Created: 20/Jan/15  Updated: 21/Apr/15  Resolved: 21/Apr/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 2.8.0-rc5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Pasette (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File 733threads.html     PNG File Screenshot 2015-02-12 15.53.36.png     File gdb.txt.1     Text File hang.txt     Text File hang2.txt    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

user tried politely shutting down with:

use admin
db.shutdownServer()

Stack trace can be found in internal dropbox here: https://dropbox.10gen.com/cailin/2015-01-20-15-57/gdb.txt



 Comments   
Comment by Daniel Pasette (Inactive) [ 14/Feb/15 ]

It's in master and 3.0. An easy way to look at what's been merged in can be found here:
src/third_party/wiredtiger/NEWS.MONGODB

Comment by Asya Kamsky [ 13/Feb/15 ]

michael.cahill
I believe my stack trace has the same __wt_txn_checkpoint indicative of eviction in checkpoint.

Thread 728 (Thread 0x7f41e6f0d700 (LWP 19977)):
#0  memset () at ../sysdeps/x86_64/memset.S:94
#1  0x00000000013181d5 in __wt_realloc ()
#2  0x0000000001325bab in ?? ()
#3  0x000000000132b29c in ?? ()
#4  0x0000000001330de1 in ?? ()
#5  0x000000000133173a in __wt_reconcile ()
#6  0x0000000001302252 in ?? ()
#7  0x0000000001302a56 in __wt_evict ()
#8  0x000000000130044d in __wt_evict_page ()
#9  0x00000000012be5d2 in __wt_page_in_func ()
#10 0x00000000012d0717 in __wt_tree_walk ()
#11 0x00000000012ca6b0 in __wt_cache_op ()
#12 0x000000000135168a in __wt_txn_checkpoint ()
#13 0x0000000001345bc6 in ?? ()
#14 0x00000000012e0bcc in ?? ()
#15 0x00007f41ea9d6182 in start_thread (arg=0x7f41e6f0d700) at pthread_create.c:312
#16 0x00007f41e9ad6fbd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

I will test the latest - it's in master, yes? dan@10gen.com is this fix in the 3.0 branch?

Comment by Daniel Pasette (Inactive) [ 12/Feb/15 ]

Attaching backtraces for hang.txt and hang2.txt.

These were caught by trying to ^C the server during a heavy insert workload running without journaling against 3.0.0-rc8.
I see michael.cahill included a possibly related fix this morning. Would like to confirm if possible.

commit 04ec3d021d2f8b08b69d3ea5d0f243f468c71f2e
Author: Michael Cahill <michael.cahill@wiredtiger.com>
Date:   Thu Feb 12 13:00:49 2015 +1100
 
    Move server thread waits to the beginning of their loops: check that we're still running before waiting.  This makes more sense to me, but also fixes a prob

Comment by Asya Kamsky [ 12/Feb/15 ]

Out of 733 threads, 590 are in TicketHolder::waitForTicket()

Comment by Eric Milkie [ 20/Jan/15 ]

I'm hoping this issue is already fixed in rc6.

Comment by Eric Milkie [ 20/Jan/15 ]

All the threads are blocked on a condvar in WiredTiger (or on a dblock), except for one thread that is currently running in __wt_evict_lru_page(). Was the system busy at the time of the hang?

Generated at Thu Feb 08 03:42:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.