[SERVER-22617] SnapshotThread hits invariant due to reading oplog entries out of order Created: 08/Feb/16  Updated: 15/Nov/21  Resolved: 24/Feb/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.2.4, 3.3.3

Type: Bug Priority: Critical - P2
Reporter: Ricardo Hilsenrath Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: code-and-test
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-2397 Cursor traversal from end of the tree... Closed
is depended on by SERVER-22629 Default --enableMajorityReadConcern t... Closed
Duplicate
is duplicated by SERVER-22610 Invariant failure in replication_coor... Closed
is duplicated by SERVER-22612 Invariant failure newCommittedSnapsho... Closed
is duplicated by SERVER-22778 Invariant failure _uncommittedSnapsho... Closed
is duplicated by SERVER-22779 Invariant failure newCommittedSnapsho... Closed
Related
related to SERVER-20707 Add a new option --enableMajorityRead... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: Integration 10 (02/22/16)
Participants:

 Description   

it happens occasionaly in our production environment

2016-02-08T16:21:02.045-0500 I -        [SnapshotThread] Invariant failure newCommittedSnapshot.opTime >= _currentCommittedSnapshot->opTime src/mongo/db/repl/replication_coordinator_impl.cpp 3228
2016-02-08T16:21:02.045-0500 I -        [SnapshotThread] 
 
***aborting after invariant() failure
 
 
2016-02-08T16:21:02.050-0500 F -        [SnapshotThread] Got signal: 6 (Aborted).
 
 0x12eaa52 0x12e9bb9 0x12ea3c2 0x7fdb551a3100 0x7fdb54e085f7 0x7fdb54e09ce8 0x127541b 0xecb495 0xecfa16 0xe90a4b 0x7fdb5597d220 0x7fdb5519bdc5 0x7fdb54ec91cd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EEAA52"},{"b":"400000","o":"EE9BB9"},{"b":"400000","o":"EEA3C2"},{"b":"7FDB55194000","o":"F100"},{"b":"7FDB54DD3000","o":"355F7"},{"b":"7FDB54DD3000","o":"36CE8"},{"b":"400000","o":"E7541B"},{"b":"400000","o":"ACB495"},{"b":"400000","o":"ACFA16"},{"b":"400000","o":"A90A4B"},{"b":"7FDB558C8000","o":"B5220"},{"b":"7FDB55194000","o":"7DC5"},{"b":"7FDB54DD3000","o":"F61CD"}],"processInfo":{ "mongodbVersion" : "3.2.1", "gitVersion" : "a14d55980c2cdc565d4704a7e3ad37e4e535c1b2", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-327.el7.x86_64", "version" : "#1 SMP Thu Oct 29 17:29:29 EDT 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "82DE04175D9E4EDD5FB2C7106D67B02C808D491C" }, { "b" : "7FFD3EACB000", "elfType" : 3, "buildId" : "17A121B1F7BBB010F54735FFDE3347B27B33884D" }, { "b" : "7FDB563C4000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "8DB4545998776514159031B754BB67F7F396F83A" }, { "b" : "7FDB55FDD000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "038F79F7C3F6E60C29184B8E70D0B1E62525D64D" }, { "b" : "7FDB55DD5000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "A1D9E0B471D827008C36FA72BAB34BE08FE54B33" }, { "b" : "7FDB55BD1000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "5958E57738366BCC217783F3CD4C836437F7C45F" }, { "b" : "7FDB558C8000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "74AAD82C77F6D211BD9E4DA27BF4F62A0D938C4B" }, { "b" : "7FDB555C6000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "02C4E38A8145201D9C574499CF75132551835CEB" }, { "b" : "7FDB553B0000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "97D5E2F5739B715C3A0EC9F95F7336E232346CA8" }, { "b" : "7FDB55194000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "FA15B7D2CA650B34E6A0C9AD999BA6625AEC4068" }, { "b" : "7FDB54DD3000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "B0A1DFA62C6AF7AA62487E3C260DC4B9C24D8BF8" }, { "b" : "7FDB56631000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "CEB78DAE1EE5B4C544047DC26F88A8E4586A34D2" }, { "b" : "7FDB54B87000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "8AB5682155DE13D0916B984306B4E044E216B2EB" }, { "b" : "7FDB548A2000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "BE8968836D439581B2816CE3827642FCF4B8BF4A" }, { "b" : "7FDB5469E000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "B25574847B066A26CD593C8101DF6779898FF2C2" }, { "b" : "7FDB5446C000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F5784ED7E64118BAFE898DBF178DC9E37CBDA4AA" }, { "b" : "7FDB54256000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "FC37913FB197B822BCDBF3697D061E248698CEC1" }, { "b" : "7FDB54047000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "4BBED12CFDC9647C8771A4B897E0D5A4F217ED7C" }, { "b" : "7FDB53E43000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "8CA73C16CFEB9A8B5660015B9223B09F87041CAD" }, { "b" : "7FDB53C29000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "D08CF135D143704DA93E5F025AE6AE6943838F03" }, { "b" : "7FDB53A04000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "5062031216B995004A297D555D834C0109F7598C" }, { "b" : "7FDB537A3000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "8E3819A80BF876382A6F0CB2A08F82F1742EE8DB" }, { "b" : "7FDB5357E000", "path" : "/lib64/liblzma.so.5", "elfType" : 3, "buildId" : "61D7D46225E85F144221E1424B87FBF3CB2B9D3F" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12eaa52]
 mongod(+0xEE9BB9) [0x12e9bb9]
 mongod(+0xEEA3C2) [0x12ea3c2]
 libpthread.so.0(+0xF100) [0x7fdb551a3100]
 libc.so.6(gsignal+0x37) [0x7fdb54e085f7]
 libc.so.6(abort+0x148) [0x7fdb54e09ce8]
 mongod(_ZN5mongo15invariantFailedEPKcS1_j+0xCB) [0x127541b]
 mongod(+0xACB495) [0xecb495]
 mongod(_ZN5mongo4repl26ReplicationCoordinatorImpl16onSnapshotCreateENS0_6OpTimeENS_12SnapshotNameE+0xC6) [0xecfa16]
 mongod(_ZN5mongo4repl14SnapshotThread3runEv+0x4FB) [0xe90a4b]
 libstdc++.so.6(+0xB5220) [0x7fdb5597d220]
 libpthread.so.0(+0x7DC5) [0x7fdb5519bdc5]
 libc.so.6(clone+0x6D) [0x7fdb54ec91cd]
-----  END BACKTRACE  -----

Steps to repro from orig ticket

I'm not sure how to reproduce it.
I have a replica set with 4 nodes running mongodb version 3.2.1 at AWS
i2.8xlarge - linux - database on ephemeral storage
i2.4xlarge - windows - database on ephemeral storage
r2.2xlarge - windows - database on EBS
t2.small - windows - arbiter
I will be happy to explain any other details you might need



 Comments   
Comment by Hoyt Ren [ 25/Feb/16 ]

Hello Youdong,

This doesn't happen again in my cluster this month, I guess it's not that frequently. I simply restart the server when this happens.

I will wait for the 3.2.4 and update as soon as possible once available.

Comment by Zhang Youdong [ 25/Feb/16 ]

I got your point. But this case happened again today in our environment (just insert、update、query、and some commands), so when will 3.2.4 be released? or I can do something to avoid this problem ASAP?

Comment by Adam Midvidy [ 25/Feb/16 ]

First off, a disclaimer: I strongly recommend against cherry-picking random commits and then running the resulting binaries in production. MongoDB, Inc. only provides support for the builds that we distribute - we put extensive effort in to validating each release against our considerable regression suite - none of which applies to picking random commits and applying them to the tree at some point in time.

That said, it is possible, with some effort, to apply only the WT commit you mentioned. The reason I didn't suggest it initially is that it won't apply cleanly because the paths in that commit are different than those used in the version of the WiredTiger sources in the MongoDB repository. If you'd like to experiment that the aforementioned fix works in your test or staging environment, this is reasonable idea.

Comment by Zhang Youdong [ 25/Feb/16 ]

Thanks for your reply,Adam Midvidy
May I just merge https://github.com/wiredtiger/wiredtiger/commit/d3893dbcbff0b0b03f6d0b83eb6651f1810cc2c6 to MongoDB-3.2.3 to solve this issue?

Comment by Adam Midvidy [ 25/Feb/16 ]

zyd_com, the commit is https://github.com/mongodb/mongo/commit/f77630a9e971cae1f921292ea31d9d40a4b096b8. Since the issue was in WiredTiger, the true fix was tracked in WT-2397. The commit I linked includes the associated change for WT-2397 as well as other fixes.

Comment by Zhang Youdong [ 25/Feb/16 ]

I have encountered this problem in out production environment yesterday(using 3.2.3), I saw that this problem is marked as 「RESOLVED」,but I cannot find the commit, would you like to provide the commit and I will merge to test.

Comment by Hoyt Ren [ 16/Feb/16 ]

I will continue in this ticket.

Here is my situation:

First I make a mistake: after re-check the other part of the log, I found that one of my crashed server is primary (auto-switched) and another is secondary (different RS and host).

When I setup the rs, I don't know ----enableMajorityReadConcern, so it should be default value.

The RS is a brand new 3.2.0 and data imported by mongoimport. The servers start at 2016-01-04. All engine are WT.

Comment by Ramon Fernandez Marina [ 15/Feb/16 ]

ricardo_fanatee, this is to let you know that I've moved this ticket back to the SERVER project. All the private information you uploaded for this ticket is in a private area and will remain private, but it's very useful for the original description to be visible to search engines so other users may find this ticket if they're affected by the same bug.

We're investigating this invariant failure at high priority, stay tuned for updates.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 12/Feb/16 ]

Thanks ricardo_fanatee, we're looking at the data you uploaded and working on a reproducer locally. We'll post updates to this ticket as we have them.

Comment by Ricardo Hilsenrath [ 09/Feb/16 ]

diagnostics.data uploaded

Comment by Ramon Fernandez Marina [ 09/Feb/16 ]

Inside your dbpath you'll find a directory named diagnostic.data, that's the data Scott was talking about. Easiest way forward is for you to zip/tar this directory up and upload it.

Comment by Ricardo Hilsenrath [ 09/Feb/16 ]

Scott

I have two small logs (the primary server that crashed, and the secondary that has priority 0), but the other secondary server that took the primary role when our "main server" went down has a 150mb log file (I'm uploading them in S3)

Where can I find the file diagnostic.data?

We made a lot of changes in the replica set on december and early january, but we're running on a stable (didn't change a single configuration on any server) for the past 2-3 weeks, and suddenly the server crashed

All members are running wiredtiger engine, and all logs are from yesterday

(I'm uploading the logs on S3)

Comment by Ramon Fernandez Marina [ 09/Feb/16 ]

ricardo_fanatee, this ticket is now in a private project as you requested. Whenever you have the information requested above by Scott you can upload it here privately. If any of the files are bit you can use this upload portal instead (also private).

Thanks,
Ramón.

Comment by Ricardo Hilsenrath [ 09/Feb/16 ]

Hi Scott,

I will be glad to share more information about the servers and logs, and as you suggested, I think it'll be better if you move this issue to a private project.

I'll prepare all the logs you asked and I'll upload them as soon as poosible

Thanks,
Ricardo

Comment by Scott Hernandez (Inactive) [ 09/Feb/16 ]

Hi Ricardo,

We would like to diagnose this issue if possible.

Can you upload the rest of the logs before that, on each occasion? It would also be useful to include logs from the other members at that same time. Also, please include the "diagnostic.data" directory with the logs. (Please let us know if you would like to provide the data privately, and if so we can move this issue from the public SERVER project to a private one.)

Which node is this from? Is it always on the same member, or does it change? Are all the members running with the same storage engine, and is the data new or from a previous version?

Generated at Thu Feb 08 04:00:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.