[SERVER-21619] sys-perf: WT crash during core_workloads_WT execution Created: 22/Nov/15  Updated: 03/Mar/16  Resolved: 30/Nov/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: 3.2.0-rc5

Type: Bug Priority: Critical - P2
Reporter: Rui Zhang (Inactive) Assignee: Alexander Gorrod
Resolution: Done Votes: 0
Labels: sys-perf-reg
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
is depended on by SERVER-21686 WiredTiger changes for 3.2.0-rc5 Closed
Related
related to SERVER-21635 No coredump on Linux when starting wi... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

There were some random crashes of mongod during core_workloads_WT test in system-perf. observations:

  • this happens in all non-sharded setups (standalone, 1-node replSet, 3-node replSet) randomly
  • Seems that it always happens during insert_ttl test
  • There was no core file generated, I am trying to figure out why. I did apply "ulimit -c unlimited" during start up mongod.
  • Manual run of insert_ttl does not re-create the crash yet, maybe need to run the whole suite?
  • The earliest SHA with this issue is 3f598f1edc (test report)

Stack trace from the mongod log file,

 [2015/11/21 17:50:03.017] 2015-11-21T22:46:58.450+0000 I NETWORK  [conn437] end connection 10.2.0.98:51172 (7 connections now open)
 [2015/11/21 17:50:03.017] 2015-11-21T22:46:59.372+0000 F -        [thread1] Invalid access at address: 0xc8
 [2015/11/21 17:50:03.017] 2015-11-21T22:46:59.379+0000 F -        [thread1] Got signal: 11 (Segmentation fault).
 [2015/11/21 17:50:03.017]  0x12c99f2 0x12c8929 0x12c8ca8 0x7ff85a737130 0x19f9699 0x19f9937 0x19fce8f 0x19c91ad 0x19c68a4 0x19c6b5a 0x7ff85a72fdf3 0x7ff85a45d1ad
 [2015/11/21 17:50:03.017] ----- BEGIN BACKTRACE -----
 [2015/11/21 17:50:03.017] {"backtrace":[{"b":"400000","o":"EC99F2"},{"b":"400000","o":"EC8929"},{"b":"400000","o":"EC8CA8"},{"b":"7FF85A728000","o":"F130"},{"b":"400000","o":"15F9699"},{"b":"400000","o":"15F9937"},{"b":"400000","o":"15FCE8F"},{"b":"400000","o":"15C91AD"},{"b":"400000","o":"15C68A4"},{"b":"400000","o":"15C6B5A"},{"b":"7FF85A728000","o":"7DF3"},{"b":"7FF85A367000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.2.0-rc3-95-g6f2a7e6", "gitVersion" : "6f2a7e6cfb69e186ee2d5ca8653dda5bf0633ef7", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.14.35-28.38.amzn1.x86_64", "version" : "#1 SMP Wed Mar 11 22:50:37 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFC129D4000", "elfType" : 3 }, { "b" : "7FF85B364000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7FF85B160000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7FF85AE5C000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7FF85AB5A000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7FF85A944000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7FF85A728000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7FF85A367000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7FF85B56C000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 [2015/11/21 17:50:03.017]  mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12c99f2]
 [2015/11/21 17:50:03.017]  mongod(+0xEC8929) [0x12c8929]
 [2015/11/21 17:50:03.017]  mongod(+0xEC8CA8) [0x12c8ca8]
 [2015/11/21 17:50:03.017]  libpthread.so.0(+0xF130) [0x7ff85a737130]
 [2015/11/21 17:50:03.017]  mongod(+0x15F9699) [0x19f9699]
 [2015/11/21 17:50:03.017]  mongod(+0x15F9937) [0x19f9937]
 [2015/11/21 17:50:03.017]  mongod(__wt_reconcile+0x27F) [0x19fce8f]
 [2015/11/21 17:50:03.018]  mongod(__wt_evict+0x28D) [0x19c91ad]
 [2015/11/21 17:50:03.018]  mongod(+0x15C68A4) [0x19c68a4]
 [2015/11/21 17:50:03.018]  mongod(+0x15C6B5A) [0x19c6b5a]
 [2015/11/21 17:50:03.018]  libpthread.so.0(+0x7DF3) [0x7ff85a72fdf3]
 [2015/11/21 17:50:03.018]  libc.so.6(clone+0x6D) [0x7ff85a45d1ad]
 [2015/11/21 17:50:03.018] -----  END BACKTRACE  -----

decode

[ec2-user@ip-10-2-0-98 t]$ addr2line -e ./mongodb/bin/mongod 0x12c99f2 0x12c8929 0x12c8ca8 0x7ff85a737130 0x19f9699 0x19f9937 0x19fce8f 0x19c91ad 0x19c68a4 0x19c6b5a 0x7ff85a72fdf3 0x7ff85a45d1ad
/srv/10gen/mci-exec/mci/src/src/mongo/util/stacktrace_posix.cpp:172
/srv/10gen/mci-exec/mci/src/src/mongo/util/signal_handlers_synchronous.cpp:180
/srv/10gen/mci-exec/mci/src/src/mongo/util/signal_handlers_synchronous.cpp:275
??:0
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:1980
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:4572
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:412
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/evict/evict_page.c:480
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/evict/evict_lru.c:1467
/srv/10gen/mci-exec/mci/src/src/third_party/wiredtiger/src/evict/evict_lru.c:818
??:0
??:0

more details here
https://evergreen.mongodb.com/task_log_raw/sys_perf_linux_1_node_replSet_core_workloads_WT_6f2a7e6cfb69e186ee2d5ca8653dda5bf0633ef7_15_11_20_23_29_14/0?type=T#L482 and link to mongod.tar.gz (https://s3.amazonaws.com/mciuploads/dsi/sys_perf_6f2a7e6cfb69e186ee2d5ca8653dda5bf0633ef7/6f2a7e6cfb69e186ee2d5ca8653dda5bf0633ef7/mongod-sys_perf_6f2a7e6cfb69e186ee2d5ca8653dda5bf0633ef7.tar.gz ) which was not stripped.

and few other crashes:



 Comments   
Comment by Githook User [ 27/Jan/16 ]

Author:

{u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'}

Message: Import wiredtiger-wiredtiger-2.7.0-505-g7fea169.tar.gz from wiredtiger branch mongodb-3.4

ref: 44463c5..7fea169

WT-2355 Fix minor scratch buffer usage in logging.
WT-2348 xargs -P isn't portable
WT-2347 Java: schema format edge cases
WT-2344 OS X compiler warning
WT-2342 Enhance wtperf to support background create and drop operations
WT-2340 Add logging guarantee assertions, whitespace
WT-2339 format post-rebalance verify failure (stress run #11586)
WT-2338 Disable using pre-allocated log files when backup cursor is open
WT-2335 NULL pointer crash in config_check_search with invalid configuration string
WT-2333 Add a flag so drop doesn't block
WT-2332 Bug in logging write-no-sync mode
WT-2331 Checking of search() result for reference cursors before join()
WT-2328 schema drop does direct unlink, it should use a block manager interface.
WT-2326 Change WTPERF to use new memory allocation functions instead of the standard
WT-2321 WT-2321: race between eviction and worker threads on the eviction queue
WT-2320 Only check copyright when cutting releases
WT-2316 stress test failure: WT_CURSOR.prev out-of-order returns
WT-2314 page-swap error handling is inconsistent
WT-2313 sweep-server: conn_dhandle.c, 610: dhandle != conn->cache->evict_file_next
WT-2312 re-creating a deleted column-store page can corrupt the in-memory tree
WT-2308 custom extractor for ref_cursors in join cursor
WT-2305 Fix coverity scan issues on 23/12/2015
WT-2296 New log algorithm needs improving for sync/flush settings
WT-2295 WT_SESSION.create does a full-scan of the main table
WT-2287 WT_SESSION.rebalance
WT-2275 broken DB after application crash
WT-2267 Improve wtperf throttling implementation to provide steady load
WT-2247 variable-length column-store in-memory page splits
WT-2242 WiredTiger treats dead trees the same as other trees in eviction
WT-2142 Connection cleanup in Python tests
WT-2073 metadata cleanups
WT-1801 Add a directory sync after rollback of a WT_SESSION::rename operation
WT-1517 schema format edge cases
SERVER-22064 Coverity analysis defect 77699: Unchecked return value
SERVER-21619 sys-perf: WT crash during core_workloads_WT execution
Branch: master
https://github.com/mongodb/mongo/commit/90118b147a6943b19dc929862a11071538db1438

Comment by Githook User [ 11/Jan/16 ]

Author:

{u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

Message: SERVER-22064: Coverity, function return value not checked for error

Coverity analysis defect 77699: Unchecked return value, problem
introduced in SERVER-21619 change, commit 354c031.

Instead of calling __wt_evict_page_clean_update() when discarding pages,
call __wt_ref_out() directly, __wt_evict_page_clean_update() doesn't do
any useful additional work.

This allows __wt_evict_page_clean_update() to be static in evict_page.c,
rename to __evict_page_clean_update().
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/ff24c1f861383f015196accf825c15f1441d16af

Comment by Githook User [ 30/Nov/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: Import wiredtiger-wiredtiger-mongodb-3.2-rc4-41-g8326df6.tar.gz from wiredtiger branch mongodb-3.2

ref: b65381f..8326df6

4c49948 WT-2244 Trigger in-memory splits sooner.
9f2e4f3 WT-2248 WT_SESSION.close is updating WT_CONNECTION_IMPL.default_session.
a6da10e SERVER-21553 Enable fast-path truncate after splits.
39dfd21 WT-2243 Don't keep transaction IDs pinned for reading from checkpoints.
4e1844c WT-2230 multi-split error path.
cace179 WT-2228 avoid unnecessary raw-compression calls.
890ee34 SERVER-21619 Don't do internal page splits after a tree is marked DEAD.
6c7338f WT-2241 Use a lock to protect transaction ID allocation.
978c237 WT-2234 Coverity analysis warnings.
Branch: master
https://github.com/mongodb/mongo/commit/e7181b542b25981db42f74cdaee4e7fc323d3e9d

Comment by Githook User [ 26/Nov/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2336 from wiredtiger/server-21619-dont-split-dead-tree

SERVER-21619 Don't do internal page splits after a tree is marked DEAD.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/890ee3447449fc72d5247035334f28c9f50bb100

Comment by Githook User [ 26/Nov/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-21619 Revert an assertion change.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/9ecf70c9c09129b71f7721754a69226e2c4b73d2

Comment by Githook User [ 26/Nov/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-21619 Push down where we mark pages clean so we don't have to repeat that logic.

Switch a few boolean values from 0/1 to false/true.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/354c0314cdf249195f4299f9e4be499dbf5cce05

Comment by Githook User [ 26/Nov/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

Message: SERVER-21619 Don't do internal page splits after a tree is marked DEAD.

It leads to problems where eviction attempts to write back to a
file after the block manager is already closed.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/d55a5b1a037f7a840a070d95327b877c5d79008a

Comment by David Hows [ 24/Nov/15 ]

The following steering variables for the insert_ttl.js test seem to let me reproduce fairly reliably.

Short version, more runs @ 16 threads and runtime down to 2 minutes from 3.

var db_name = "ttlTest";                                                                                                                           
var expTime = 0; // 0 seconds expireAfterSeconds                                                                                                   
var monitorSleep = 5; // 5 second ttlMonitorSleepSecs                                                                                              
var batchSize = 1000;                                                                                                                              
var docSize = 1024;                                                                                                                                
var testTime = 120;                                                                                                                                
var threads = [16, 16, 16, 16, 16, 16, 16, 16, 16, 16, 16];  

Comment by David Hows [ 24/Nov/15 ]

Background as to what we have found thus far.

The segfault is because we are trying to evict a page from a tree that has already been closed.

Following the drop during the start of the next test we see the error. It seems that we need to have the test running inserts and TTL deletes for long enough to get the table big enough, and then drop it to cause the fault.

Will attach a modified version of the test shortly that should have better luck in reproduction.

Comment by David Hows [ 24/Nov/15 ]

Okay, just brute force'd it myself on the tip of master and inside of GDB.

(gdb) bt
#0  0x0000000001bb09a3 in __rec_split_init (session=session@entry=0x2df1800, r=r@entry=0x6249e40, page=page@entry=0x8579340, recno=recno@entry=0, max=4096)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:1980
#1  0x0000000001bb0c77 in __rec_row_int (session=session@entry=0x2df1800, r=r@entry=0x6249e40, page=page@entry=0x8579340)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:4579
#2  0x0000000001bb384e in __wt_reconcile (session=session@entry=0x2df1800, ref=ref@entry=0x7e1e310, salvage=salvage@entry=0x0, flags=1)
    at src/third_party/wiredtiger/src/reconcile/rec_write.c:412
#3  0x0000000001b80bed in __evict_review (closing=false, inmem_splitp=<synthetic pointer>, ref=0x7e1e310, session=0x2df1800) at src/third_party/wiredtiger/src/evict/evict_page.c:481
#4  __wt_evict (session=session@entry=0x2df1800, ref=ref@entry=0x7e1e310, closing=closing@entry=false) at src/third_party/wiredtiger/src/evict/evict_page.c:79
#5  0x0000000001b7e0a6 in __evict_page (session=session@entry=0x2df1800, is_server=is_server@entry=false) at src/third_party/wiredtiger/src/evict/evict_lru.c:1467
#6  0x0000000001b7e2ba in __evict_lru_pages (is_server=<optimized out>, session=<optimized out>) at src/third_party/wiredtiger/src/evict/evict_lru.c:818
#7  __evict_worker (arg=0x2b18500) at src/third_party/wiredtiger/src/evict/evict_lru.c:434
#8  0x00007ffff6f234a4 in start_thread () from /usr/lib/libpthread.so.0
#9  0x00007ffff6c6113d in clone () from /usr/lib/libc.so.6

First quick and dirty finding is that the btree's block manager is currently 0, so thats the cause of the segfault.

Question now is, why is there no block manager? More to follow.

Comment by David Hows [ 24/Nov/15 ]

rui.zhang,

Was there any particular trick to reproducing or were you just running the perf test over and over?

Can you share your method that finally got a repro going as well as the githash of the binary in question?

I'd like to do some diving this end and don't want to spend my time re-inventing the wheel.

Thanks!

Comment by Michael Cahill (Inactive) [ 24/Nov/15 ]

daveh86, can you please take a look at this one: let me know if you can reproduce it and if so, what information you can gather?

Comment by Michael Cahill (Inactive) [ 23/Nov/15 ]

rui.zhang, what we're seeing here is WiredTiger attempting to evict a dirty page and write a block into a data file, but WT's "block manager" structure is NULL. I haven't seen this elsewhere.

The code at the crash site hasn't changed recently, but two things occur to me as possibilities:

  1. if a collection is dropped, it stays in the system temporarily with pages in cache after the block manager structure is freed. However, the tree should be marked "dead" in that case so eviction just discards pages from cache without writing.
  2. a race during shutdown, where eviction is running as WiredTiger discards trees from cache.

Can you please post details about how to run the test? Even if it doesn't repro every time, we can leave it running in a loop and see what happens when it fails.

Comment by Rui Zhang (Inactive) [ 22/Nov/15 ]

Feel free to take this if the details are enough. Will work on a simple/reliable repro and try to get the core file.

Comment by Rui Zhang (Inactive) [ 22/Nov/15 ]

The range of SHA this may introduced, 3f598f1edc is the first show this issue here

3f598f1edc Kaloian Manassiev - SERVER-21542 Optimize spinlock on non-windows systems This reverts commit 30458a9c6160b01925a8bdd768deb98ed8da8cb0.
 
30458a9c61 - Kaloian Manassiev
Revert "SERVER-21542 Optimize spinlock on non-windows systems" This reverts commit 18600ed909f739ab1bc3032c2a230aeac1abe1c2.
 
11/19/15 6:42 AM
a0771ea5ec - Ramon Fernandez
Import wiredtiger-wiredtiger-mongodb-3.2.0-rc3-177-g9d375e3.tar.gz from wiredtiger branch mongodb-3.2 ref: d9ec1ff..9d375e3 16c0a1a WT-1315 Fix some leaks with join cursors. 59857f9 WT-2222 Add statistics for named snapshots. 4368d39 WT-1315 Cursor join implementation a72ddb7 WT-2218 Add truncate stats fb9cebe WT-2224 Track which deleted refs are discarded by a split. e2f1130 WT-2220 Split WT_TIMEDIFF macro into unit specific macros. be412b5 WT-2182 when internal pages grow large enough, split them into their parents ce8c091 WT-2219 Enhancements to in-memory testing 347d922 WT-2220 time_t cleanup. 08c0fcd WT-2217 change WT_CURSOR.insert to clear "set" key/value on return d1b5e7f WT-2135 Fix log_only setting for backup cursor. Fix initialization. 78bd4ac WT-2210 raw compression fails if row-store recovery precedes column-store recovery c1b2634 WT-2182 fixes for splitting up the tree. 0a1ee34 WT-2199 Fix transaction sync inconsistency. ee31bb2 WT-2182 Simplify the split deepen logic. c360d53 WT-2212 Add a "use_environment" config to "wiredtiger_open" 3f132a4 WT-2182 detect internal page split races.
 
11/19/15 6:39 AM
042d8fa2d2 - Mathias Stearn
SERVER-21542 replace non-standard pthread_yield with standardized sched_yield
 
11/19/15 6:34 AM
24fb61592a - Randolph Tan
SERVER-21340 Add basic test for sharding server status
 
11/18/15 7:47 PM
864111bec7 - David Hows
SERVER-20996 - Avoid advancing cursors multiple times
 
11/18/15 7:35 PM
1944e28410 - David Hows
SERVER-21078 - Fix race conditions by using the session cache more within the wtkv engine code
 
11/18/15 4:18 PM
e2eff50dba - David Hows
SERVER-20877 - fix linting errors
 
11/18/15 3:26 PM
68dfaba3e8 - Kaloian Manassiev
SERVER-21489 enableSharding should return error if sharding is already enabled
 
11/18/15 3:26 PM
f42a080cc0 - Kaloian Manassiev
SERVER-21533 Lock manager should not starve requests This change makes the lock manager granting code skip granting compatible requests if the first entry on the queue is a conflict.
 
11/18/15 3:26 PM
c5e3d38ac9 - David Hows
SERVER-20877 - Allow getSession to specify if a WiredTiger txn is needed

Generated at Thu Feb 08 03:57:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.