ISSUE SUMMARY
A bug in the WiredTiger algorithm for page management may cause the server to abort with a message like the following:
2015-02-21T07:08:08.397-0800 E STORAGE [WiredTigerRecordStoreThread for local.oplog.$main] WiredTiger (22) [1424531288:395397][1514015:0x7f903a7ff700], session.truncate: the start cursor position is after the stop cursor position: Invalid argument
|
2015-02-21T07:08:08.399-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main] Invariant failure: ret resulted in status BadValue 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 518
|
USER IMPACT
The mongod process aborts after the invariant failure and needs to be restarted. This can only impact capped collections.
WORKAROUNDS
N/A
AFFECTED VERSIONS
MongoDB 3.0.0 running with the WiredTiger storage engine.
FIX VERSION
The fix is included in the 3.0.1 production release.
RESOLUTION DETAILS
WiredTiger no longer incorrectly positions the start cursor during the truncate operation.
Original description
I get this from iibench-mongodb with 10 insert threads, 1 query thread and zlib after about 300M docs have been inserted. It did not occur with snappy. Prior to this crash there were many stalls (slow inserts) and that was also only an issue with zlib. This appears to be different than the "session.truncate" crash that occurred prior to rc9 (see https://jira.mongodb.org/browse/SERVER-17141)
log has this content:
2015-02-21T07:08:08.397-0800 E STORAGE [WiredTigerRecordStoreThread for local.oplog.$main] WiredTiger (22) [1424531288:395397][1514015:0x7f903a7ff700], session.truncate: the start cursor position is after the stop cursor position: Invalid argument
|
2015-02-21T07:08:08.399-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main] Invariant failure: ret resulted in status BadValue 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 518
|
2015-02-21T07:08:08.409-0800 I CONTROL [WiredTigerRecordStoreThread for local.oplog.$main]
|
0x117b00b 0x1114881 0x10f7fb8 0xf267df 0xf2cbbf 0x10f941e 0x11c8358 0x7f904931e7c9 0x7f90490578ad 0
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"D7B00B"},{"b":"400000","o":"D14881"},{"b":"400000","o":"CF7FB8"},{"b":"400000","o":"B267DF"},{"b":"400000","o":"B2CBBF"},{"b":"400000","o":"CF941E"},{"b":"400000","o":"DC8358"},{"b":"7F9049317000","o":"77C9"},{"b":"7F9048F4B000","o":"10C8AD"},{"b":"0","o":"0"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "nogitversion", "uname" : { "sysname" : "Linux", "release" : "3.10.53-43_fbk14_02627_gb5c286c", "version" : "#1 SMP Mon Oct 20 15:06:57 PDT 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "9D8BCA7928778A629A6BD7DD44CF150933BC2E78" }, { "b" : "7FFFA18FE000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "50103D521098AC1A8DA637FE9F84CCFE48B62B25" }, { "b" : "7F9049F87000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/librt.so.1", "elfType" : 3, "buildId" : "36E417C497D7D99E323324B54709FFF95E21F209" }, { "b" : "7F9049D83000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libdl.so.2", "elfType" : 3, "buildId" : "7258590522FF8998FF3DC9CD4800354FF634B049" }, { "b" : "7F9049A52000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libstdc++.so.6", "elfType" : 3, "buildId" : "AADD96ABFCFED89F9C3343339F009D973AC5822F" }, { "b" : "7F904974B000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libm.so.6", "elfType" : 3, "buildId" : "5180B0D6214EFC6BD4024BCC46459868BB9FF312" }, { "b" : "7F9049534000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libgcc_s.so.1", "elfType" : 3, "buildId" : "CAE88F8F821F6A585CC52F17AA185621534F442C" }, { "b" : "7F9049317000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libpthread.so.0", "elfType" : 3, "buildId" : "F3CF4B6D065A6FCF55D752C1B7FBF951C041B871" }, { "b" : "7F9048F4B000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libc.so.6", "elfType" : 3, "buildId" : "C645843801A68480D80BF9D0FD4A20182CBD1D9C" }, { "b" : "7F904A190000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/ld.so", "elfType" : 3, "buildId" : "FAF6C327CBADE1769B09F51830B64480E9FFDC82" } ] }}
|
mongod.30rc9.jem.norox(_ZN5mongo15printStackTraceERSo+0x2B) [0x117b00b]
|
mongod.30rc9.jem.norox(_ZN5mongo10logContextEPKc+0xE1) [0x1114881]
|
mongod.30rc9.jem.norox(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x108) [0x10f7fb8]
|
mongod.30rc9.jem.norox(_ZN5mongo21WiredTigerRecordStore27cappedDeleteAsNeeded_inlockEPNS_16OperationContextERKNS_8RecordIdE+0x8EF) [0xf267df]
|
mongod.30rc9.jem.norox(+0xB2CBBF) [0xf2cbbf]
|
mongod.30rc9.jem.norox(_ZN5mongo13BackgroundJob7jobBodyEv+0x11E) [0x10f941e]
|
mongod.30rc9.jem.norox(+0xDC8358) [0x11c8358]
|
libpthread.so.0(+0x77C9) [0x7f904931e7c9]
|
libc.so.6(clone+0x6D) [0x7f90490578ad]
|
??? [0]
|
----- END BACKTRACE -----
|
2015-02-21T07:08:08.409-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main]
|
|
***aborting after invariant() failure
|
|
|
Note: we believe the fix in https://github.com/wiredtiger/wiredtiger/pull/1731 is correct. It has been merged into WiredTiger develop and will go through MongoDB testing shortly.
|
|
Possible fix linked here, testing now.
|
|
michael.cahill, I've gotten a look at the tree. The one oddity I see is the path to the incorrect search key is via a leaf page referenced from an internal page that otherwise only references internal pages. So, it looks like this:
0x3cc98b200: row-store internal
|
disk (nil), entries 100, dirty, write generation=1
|
...
|
K {1425008395, 1420} <<< internal
|
memory 0x147218aa80 [NoAddr]
|
K {1425008395, 5883} <<< internal
|
memory 0x147218aad0 [NoAddr]
|
K {1425008396, 1919} <<< internal
|
memory 0x147218ab20 [NoAddr]
|
K {1425008399, 581} <<< internal
|
memory 0x147218ab70 [NoAddr]
|
K {1425008400, 1478} <<< internal
|
memory 0xd90eae7b0 [2934874112-2934878208, 4096, 4092605218]
|
K {1425008401, 8096} <<< internal
|
memory 0x147218ac10 [NoAddr]
|
K {1425008423, 1497} <<< internal: CORRECT "FIRST" KEY
|
memory 0x147218ac60 [NoAddr]
|
K {1425008628, 3139} <<< leaf: INCORRECT "FIRST" KEY
|
memory 0x55d90ba40 [1708761088-1708793856, 32768, 2920185554]
|
It's also interesting the write-generation of the internal page shared by both the correct/incorrect search keys is 1.
|
|
I've now seen this failure on a system with the SERVER-17141 fix, so I don't believe they're related.
|
|
michael.cahill, here's the cursor pattern when we reach the failure:
6119733380620749582, 6119733380620752036 <<< start, stop cursor keys
|
6119733380620752037, 6119733380620754371
|
6119733380620754372, 6119733380620756227
|
6119733380620756228, 6119733384915715571
|
6119733384915715572, 6119733384915716610
|
6119733384915716611, 6119733384915716910
|
6119733384915716911, 6119733384915716931
|
6119733384915716932, 6119733384915717296
|
6119733384915719796, 6119733384915717920 <<< start, stop cursor keys that failed order test
|
|
6119733384915717297 <<< first record key in the collection at time of failure
|
6119739174531629688 <<< last record key
|
So, the failing start cursor is valid for the collection, but for some reason has skipped about 2,000 records forward from the beginning?
|
|
Another crash at 300M inserts
2015-02-25T00:28:41.013-0800 E STORAGE [WiredTigerRecordStoreThread for local.oplog.$main] WiredTiger (22) [1424852921:11118][3107994:0x7f9bde7ff700], session.truncate: the start cursor position is after the stop cursor position: Invalid argument
|
2015-02-25T00:28:41.014-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main] Invariant failure: ret resulted in status BadValue 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 518
|
2015-02-25T00:28:41.024-0800 I CONTROL [WiredTigerRecordStoreThread for local.oplog.$main]
|
0x117b00b 0x1114881 0x10f7fb8 0xf267df 0xf2cbbf 0x10f941e 0x11c8358 0x7f9bed9cb7c9 0x7f9bed7048ad 0
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"D7B00B"},{"b":"400000","o":"D14881"},{"b":"400000","o":"CF7FB8"},{"b":"400000","o":"B267DF"},{"b":"400000","o":"B2CBBF"},{"b":"400000","o":"CF941E"},{"b":"400000","o":"DC8358"},{"b":"7F9BED9C4000","o":"77C9"},{"b":"7F9BED5F8000","o":"10C8AD"},{"b":"0","o":"0"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "nogitversion", "uname" : { "sysname" : "Linux", "release" : "3.10.53-43_fbk14_02627_gb5c286c", "version" : "#1 SMP Mon Oct 20 15:06:57 PDT 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "BD10D7FB7B275429775254B4DACB57EA6DF6BD40" }, { "b" : "7FFFAE6B2000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "50103D521098AC1A8DA637FE9F84CCFE48B62B25" }, { "b" : "7F9BEE634000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/librt.so.1", "elfType" : 3, "buildId" : "36E417C497D7D99E323324B54709FFF95E21F209" }, { "b" : "7F9BEE430000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libdl.so.2", "elfType" : 3, "buildId" : "7258590522FF8998FF3DC9CD4800354FF634B049" }, { "b" : "7F9BEE0FF000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libstdc++.so.6", "elfType" : 3, "buildId" : "AADD96ABFCFED89F9C3343339F009D973AC5822F" }, { "b" : "7F9BEDDF8000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libm.so.6", "elfType" : 3, "buildId" : "5180B0D6214EFC6BD4024BCC46459868BB9FF312" }, { "b" : "7F9BEDBE1000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libgcc_s.so.1", "elfType" : 3, "buildId" : "CAE88F8F821F6A585CC52F17AA185621534F442C" }, { "b" : "7F9BED9C4000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libpthread.so.0", "elfType" : 3, "buildId" : "F3CF4B6D065A6FCF55D752C1B7FBF951C041B871" }, { "b" : "7F9BED5F8000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libc.so.6", "elfType" : 3, "buildId" : "C645843801A68480D80BF9D0FD4A20182CBD1D9C" }, { "b" : "7F9BEE83D000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/ld.so", "elfType" : 3, "buildId" : "FAF6C327CBADE1769B09F51830B64480E9FFDC82" } ] }}
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo15printStackTraceERSo+0x2B) [0x117b00b]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo10logContextEPKc+0xE1) [0x1114881]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x108) [0x10f7fb8]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo21WiredTigerRecordStore27cappedDeleteAsNeeded_inlockEPNS_16OperationContextERKNS_8RecordIdE+0x8EF) [0xf267df]
|
mongod.30rc9.jem.norox.dbg(+0xB2CBBF) [0xf2cbbf]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo13BackgroundJob7jobBodyEv+0x11E) [0x10f941e]
|
mongod.30rc9.jem.norox.dbg(+0xDC8358) [0x11c8358]
|
libpthread.so.0(+0x77C9) [0x7f9bed9cb7c9]
|
libc.so.6(clone+0x6D) [0x7f9bed7048ad]
|
??? [0]
|
----- END BACKTRACE -----
|
2015-02-25T00:28:41.025-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main]
|
|
***aborting after invariant() failure
|
|
|
Crash after 454M inserts. AFAIK my binary is rc9 plus a patch from Keith but I don't see the new error text here. Will make clean then rebuild and retest.
2015-02-24T13:04:27.779-0800 E STORAGE [WiredTigerRecordStoreThread for local.oplog.$main] WiredTiger (22) [1424811867:776941][1398729:0x7f9bb97ff700], session.truncate: the start cursor position is after the stop cursor position: Invalid argument
|
2015-02-24T13:04:27.780-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main] Invariant failure: ret resulted in status BadValue 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 518
|
2015-02-24T13:04:27.791-0800 I CONTROL [WiredTigerRecordStoreThread for local.oplog.$main]
|
0x117b00b 0x1114881 0x10f7fb8 0xf267df 0xf2cbbf 0x10f941e 0x11c8358 0x7f9bc87697c9 0x7f9bc84a28ad 0
|
----- BEGIN BACKTRACE -----
|
{"backtrace":[{"b":"400000","o":"D7B00B"},{"b":"400000","o":"D14881"},{"b":"400000","o":"CF7FB8"},{"b":"400000","o":"B267DF"},{"b":"400000","o":"B2CBBF"},{"b":"400000","o":"CF941E"},{"b":"400000","o":"DC8358"},{"b":"7F9BC8762000","o":"77C9"},{"b":"7F9BC8396000","o":"10C8AD"},{"b":"0","o":"0"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "nogitversion", "uname" : { "sysname" : "Linux", "release" : "3.10.53-43_fbk14_02627_gb5c286c", "version" : "#1 SMP Mon Oct 20 15:06:57 PDT 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "BD10D7FB7B275429775254B4DACB57EA6DF6BD40" }, { "b" : "7FFF406FE000", "path" : "linux-vdso.so.1", "elfType" : 3, "buildId" : "50103D521098AC1A8DA637FE9F84CCFE48B62B25" }, { "b" : "7F9BC93D2000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/librt.so.1", "elfType" : 3, "buildId" : "36E417C497D7D99E323324B54709FFF95E21F209" }, { "b" : "7F9BC91CE000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libdl.so.2", "elfType" : 3, "buildId" : "7258590522FF8998FF3DC9CD4800354FF634B049" }, { "b" : "7F9BC8E9D000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libstdc++.so.6", "elfType" : 3, "buildId" : "AADD96ABFCFED89F9C3343339F009D973AC5822F" }, { "b" : "7F9BC8B96000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libm.so.6", "elfType" : 3, "buildId" : "5180B0D6214EFC6BD4024BCC46459868BB9FF312" }, { "b" : "7F9BC897F000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libgcc_s.so.1", "elfType" : 3, "buildId" : "CAE88F8F821F6A585CC52F17AA185621534F442C" }, { "b" : "7F9BC8762000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libpthread.so.0", "elfType" : 3, "buildId" : "F3CF4B6D065A6FCF55D752C1B7FBF951C041B871" }, { "b" : "7F9BC8396000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/libc.so.6", "elfType" : 3, "buildId" : "C645843801A68480D80BF9D0FD4A20182CBD1D9C" }, { "b" : "7F9BC95DB000", "path" : "/usr/local/fbcode/gcc-4.9-glibc-2.20/lib/ld.so", "elfType" : 3, "buildId" : "FAF6C327CBADE1769B09F51830B64480E9FFDC82" } ] }}
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo15printStackTraceERSo+0x2B) [0x117b00b]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo10logContextEPKc+0xE1) [0x1114881]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0x108) [0x10f7fb8]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo21WiredTigerRecordStore27cappedDeleteAsNeeded_inlockEPNS_16OperationContextERKNS_8RecordIdE+0x8EF) [0xf267df]
|
mongod.30rc9.jem.norox.dbg(+0xB2CBBF) [0xf2cbbf]
|
mongod.30rc9.jem.norox.dbg(_ZN5mongo13BackgroundJob7jobBodyEv+0x11E) [0x10f941e]
|
mongod.30rc9.jem.norox.dbg(+0xDC8358) [0x11c8358]
|
libpthread.so.0(+0x77C9) [0x7f9bc87697c9]
|
libc.so.6(clone+0x6D) [0x7f9bc84a28ad]
|
??? [0]
|
----- END BACKTRACE -----
|
2015-02-24T13:04:27.791-0800 I - [WiredTigerRecordStoreThread for local.oplog.$main]
|
|
***aborting after invariant() failure
|
|
|
Saw this one again (after 300M inserts). The failing cursor start/stop keys are:
start cursor {#e8T#ecf#1c#00#00#16Q}
|
stop cursor {#e8T#ecf#1c#00#00#13#1c}
|
Note: this was in a system without the SERVER-17141 fix.
|
|
FYI, while running these tests the insert rate is about 2X higher (10k vs 5k per second) with 51200 MB oplog vs 2000 MB oplog.
|
|
I have 4 hosts running with oplogSizeMB 2000 and one without it set that gets
creating replication oplog of size: 51200MB..
The crash has only occurred once and that was on the host without it set.
|
|
Please send me the patch
|
|
I am unable to reproduce this one so far (2 runs of 45M inserts), I'm leaving the second one running.
Mark, if you can reproduce it, I can offer a patch that should help us to diagnose the problem?
|
|
Repeated the test with gdb attached, no crash. Repeating again on 3 hosts.
|
|
Thank you for the suggestion. My years of filing bugs for MySQL trained me to use Google search which works great for bugs.mysql.com
|
|
mcallaghan@fb.com, the JIRA search for "start cursor position is after the stop cursor position" does list SERVER-17340. Hope that helps.
|
|
Can we do something so that google search like this would find the 17340?
"site:jira.mongodb.org start cursor position is after the stop cursor position"
|
|
This looks the same as https://jira.mongodb.org/browse/SERVER-17340.
|
|
reproduction test
|
Generated at Thu Feb 08 03:44:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.