[SERVER-38047] Mongo 3.4.17 crash (WiredTiger error) Created: 09/Nov/18  Updated: 30/Nov/18  Resolved: 30/Nov/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.17
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tzach Yarimi Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 16.04.5 LTS
AWS i3.4xlarge


Participants:

 Description   

Log:

018-11-08T22:22:49.397+0000 E STORAGE [thread2] WiredTiger error (22) [1541715769:397467][110545:0x7f3a43f80700], file:impl_condeco_group_l_30329/collection-8059533--382801488
1901953997.wt, WT_SESSION.checkpoint: (null): merge range 18665472-18694144 overlaps with existing range 18673664-18677760: Invalid argument
2018-11-08T22:22:49.397+0000 E STORAGE [thread2] WiredTiger error (-31804) [1541715769:397565][110545:0x7f3a43f80700], file:impl_condeco_group_l_30329/collection-8059533--38280
14881901953997.wt, WT_SESSION.checkpoint: the process must exit and restart: WT_PANIC: WiredTiger library panic
2018-11-08T22:22:49.398+0000 I - [thread2] Fatal Assertion 28558 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 365
2018-11-08T22:22:49.398+0000 I - [thread2]
***aborting after fassert() failure
2018-11-08T22:22:49.422+0000 I - [WTJournalFlusher] Fatal Assertion 28559 at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp 64
2018-11-08T22:22:49.422+0000 I - [WTJournalFlusher]
***aborting after fassert() failure
2018-11-08T22:22:49.485+0000 F - [thread2] Got signal: 6 (Aborted).
{{ 0x55b6fb8c02e1 0x55b6fb8bf4f9 0x55b6fb8bf9dd 0x7f3a49366390 0x7f3a48fc0428 0x7f3a48fc202a 0x55b6fab52d9b 0x55b6fb5c41b6 0x55b6fab5d4f0 0x55b6fab5d70c 0x55b6fab5d96e 0x55b6fc1cd}}
29f 0x55b6fc1ce35e 0x55b6fc1cbd57 0x55b6fc26e679 0x55b6fc26e984 0x55b6fc2bcf0d 0x55b6fc2bd319 0x55b6fc2aa051 0x55b6fc221c9d 0x7f3a4935c6ba 0x7f3a4909241d
----- BEGIN BACKTRACE -----
{{

{"backtrace":[\{"b":"55B6FA334000","o":"158C2E1","s":"_ZN5mongo15printStackTraceERSo"}

,{"b":"55B6FA334000","o":"158B4F9"},{"b":"55B6FA334000","o":"158B9DD"},{"b":"7F3A49355000","}}
o":"11390"},{"b":"7F3A48F8B000","o":"35428","s":"gsignal"},{"b":"7F3A48F8B000","o":"3702A","s":"abort"},{"b":"55B6FA334000","o":"81ED9B","s":"_ZN5mongo32fassertFailedNoTraceWith
LocationEiPKcj"},{"b":"55B6FA334000","o":"12901B6"},{"b":"55B6FA334000","o":"8294F0","s":"_wt_eventv"},{"b":"55B6FA334000","o":"82970C","s":"_wt_err"},{"b":"55B6FA334000","o":
"82996E","s":"_wt_panic"},{"b":"55B6FA334000","o":"1E9929F"},{"b":"55B6FA334000","o":"1E9A35E","s":"wt_block_extlist_merge"},{"b":"55B6FA334000","o":"1E97D57","s":"_wt_block
checkpoint_resolve"},{"b":"55B6FA334000","o":"1F3A679"},{"b":"55B6FA334000","o":"1F3A984","s":"_wt_meta_track_off"},{"b":"55B6FA334000","o":"1F88F0D"},{"b":"55B6FA334000","o":
"1F89319","s":"__wt_txn_checkpoint"},{"b":"55B6FA334000","o":"1F76051"},{"b":"55B6FA334000","o":"1EEDC9D"},{"b":"7F3A49355000","o":"76BA"},{"b":"7F3A48F8B000","o":"10741D","s":"
clone"}],"processInfo":{ "mongodbVersion" : "3.4.17", "gitVersion" : "7c14a47868643bb691a507a92fe25541f998eca4", "compiledModules" : [], "uname" : { "sysname" : "Linux", "releas
e" : "4.4.0-1065-aws", "version" : "#75-Ubuntu SMP Fri Aug 10 11:14:32 UTC 2018", "machine" : "x86_64" }, "somap" : [ { "b" : "55B6FA334000", "elfType" : 3, "buildId" : "4666D43
DAC0EEFE6CF01C4D660FEEDECE3E6E375" }, { "b" : "7FFE2557B000", "elfType" : 3, "buildId" : "98B173804DDFA7204D8EC8829DB1D865B54DCD24" }, { "b" : "7F3A4A2E1000", "path" : "/lib/x86
{{_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "513282AC7EB386E2C0133FD9E1B6B8A0F38B047D" }, { "b" : "7F3A49E9D000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "250E875F74377DFC74DE48BF80CCB237BB4EFF1D" }, { "b" : "7F3A49C95000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F3A49A91000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F3A49788000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F3A49572000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F3A49355000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F3A48F8B000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F3A4A54A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}}}
{{ mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55b6fb8c02e1]}}
{{ mongod(+0x158B4F9) [0x55b6fb8bf4f9]}}
{{ mongod(+0x158B9DD) [0x55b6fb8bf9dd]}}
{{ libpthread.so.0(+0x11390) [0x7f3a49366390]}}
{{ libc.so.6(gsignal+0x38) [0x7f3a48fc0428]}}
{{ libc.so.6(abort+0x16A) [0x7f3a48fc202a]}}
{{ mongod(_ZN5mongo32fassertFailedNoTraceWithLocationEiPKcj+0x0) [0x55b6fab52d9b]}}
{{ mongod(+0x12901B6) [0x55b6fb5c41b6]}}
{{ mongod(__wt_eventv+0x3D7) [0x55b6fab5d4f0]}}
{{ mongod(__wt_err+0x9D) [0x55b6fab5d70c]}}
{{ mongod(__wt_panic+0x2E) [0x55b6fab5d96e]}}
{{ mongod(+0x1E9929F) [0x55b6fc1cd29f]}}
{{ mongod(__wt_block_extlist_merge+0x7E) [0x55b6fc1ce35e]}}
{{ mongod(__wt_block_checkpoint_resolve+0x57) [0x55b6fc1cbd57]}}
{{ mongod(+0x1F3A679) [0x55b6fc26e679]}}
{{ mongod(__wt_meta_track_off+0x154) [0x55b6fc26e984]}}
{{ mongod(+0x1F88F0D) [0x55b6fc2bcf0d]}}
{{ mongod(__wt_txn_checkpoint+0xD9) [0x55b6fc2bd319]}}
{{ mongod(+0x1F76051) [0x55b6fc2aa051]}}
{{ mongod(+0x1EEDC9D) [0x55b6fc221c9d]}}
{{ libpthread.so.0(+0x76BA) [0x7f3a4935c6ba]}}
{{ libc.so.6(clone+0x6D) [0x7f3a4909241d]}}
----- END BACKTRACE -----



 Comments   
Comment by Danny Hatcher (Inactive) [ 30/Nov/18 ]

Hello Tzach,

As I have not heard back from you and this appears to have been resolved in WT-2897, I will now close this ticket.

Thank you,

Danny

Comment by Danny Hatcher (Inactive) [ 15/Nov/18 ]

Hello Tzach,

Unfortunately, due to the nature of the issue it may be difficult to prove beyond all doubt that you will never encounter this issue again. The safest method would be to perform a rolling initial sync across your cluster to ensure that all data files have no chance of having the problem. That being said, if your other nodes have been running on 3.4.x for a significant period of time without failing then you will most likely be fine with the snapshot.

Would it be possible to bring your replica set up to full using the snapshot and then schedule some maintenance time over the next few weeks to initial sync the nodes one at a time?

Thank you,

Danny

Comment by Tzach Yarimi [ 15/Nov/18 ]

Thanks Danny,

We did create the failing node from snapshot, however we don't have a way of knowing which of our existing nodes is "healthy", since they were all created a long time ago on Mongo 3.2.8 and were upgraded to 3.4.X.

Is there a check we can run to validate that a node is healthy?

Comment by Danny Hatcher (Inactive) [ 14/Nov/18 ]

Hello Tzach,

If you are only experiencing the problem on one node and the rest of the nodes in the replica set are fine, you should be able to use your normal instance creation procedure as long as the snapshot is taken from a healthy node.

Thank you,

Danny

Comment by Tzach Yarimi [ 11/Nov/18 ]

Hi Danny,

Yes, this instance was created on 3.2.8, then upgraded to 3.2.13, then 3.4.X.

In our use case, doing an initial sync requires a long downtime, as the DB is 2TB and we are write heavy.

Usually when we need a new instance, we create one from an AWS EBS snapshot. I guess that this won't fix the issue as the data files are not cleared, correct?

Is there a different solution that will not require an initial sync?

Thanks,

Tzach

Comment by Danny Hatcher (Inactive) [ 09/Nov/18 ]

Hello Tzach,

This looks similar to the issue fixed in WT-2897 but that was resolved in 3.4.0. Was this database ever on a version before 3.2.10?

If you have healthy replica set nodes, I recommend clearing the data files and performing an initial sync.

Thank you,

Danny

Comment by Tzach Yarimi [ 09/Nov/18 ]

Restarting the mongo service didn't help - it kept crashing repeatedly.

Generated at Thu Feb 08 04:47:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.