[SERVER-60371] Fatal assertion - msgid 34437 - DuplicateKey Created: 30/Sep/21  Updated: 18/Oct/21  Resolved: 18/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Rob Colella Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: Bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS EC2 Linux2 Graviton


Issue Links:
Duplicate
duplicates WT-7984 Fix a bug that could cause a checkpoi... Closed
duplicates WT-7995 Fix the global visibility so that it ... Closed
Related
related to WT-7984 Fix a bug that could cause a checkpoi... Closed
related to WT-7995 Fix the global visibility so that it ... Closed
Sprint: Storage - Ra 2021-10-04
Participants:
Story Points: 8

 Description   

Previously running on version 4.4.8 in a three node repl set and experienced DuplicateKey bug which crashed mongodb and required resync on secondary node.

Issue tracked here: https://jira.mongodb.org/browse/WT-7984

 

Upgraded all nodes to version 4.4.9 which was supposed to resolve this, however, just hit the same bug again.

{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"-",        "id":23095,   "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" } }: E11000 duplicate key error collection: Session.Session index: _id_ dup key: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" }","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}}
{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"-",        "id":23096,   "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"}



 Comments   
Comment by Eric Sedor [ 04/Oct/21 ]

rob.colella@kibocommerce.com, thanks for the update.

A complete initial sync is the remediation we recommend. It does not technically have to be performed off of a primary node, but it does have to be performed off of an un-impacted node. You may be able to imagine other options that will work for you, but we consider this the most straightforward, complete, and least disruptive option.

Comment by Rob Colella [ 04/Oct/21 ]

Hi Eric, just wanted to give you an update and let you know we are currently working through running a validate against all nodes in all clusters.
I have seen a few errors in some of the smaller collections.
Can you please confirm the only fix would be a completely resync of the node from the primary?

Comment by Eric Sedor [ 01/Oct/21 ]

Hi rob.colella@kibocommerce.com;

There are two issues in 4.4.8 (and earlier) which could persist data inconsistencies in ways that could later cause duplicate key failures after upgrading to 4.4.9. They are WT-7984 (which you saw) and WT-7995. In particular, WT-7995 affects 4.4.2-4.4.8.

Upgrading to 4.4.9 will remove the risk of persisting further inconsistencies but does not by itself correct the impact of these bugs. Some inconsistencies created could be lying in wait to trigger duplicate key issues when the circumstances allow.

What we'd like to confirm is that all nodes have been remediated of all possible impact of the bugs from earlier versions. That's why it's important to perform validate on all collections of all nodes after upgrading to 4.4.9, and to confirm all collections on all nodes have passed validate on 4.4.9.

It sounds like you're starting that process, so I'll look forward to hearing back from you. Please let me know if I can clarify further.

Sincerely,
Eric

Comment by Chenhao Qu [ 01/Oct/21 ]

rob.colella@kibocommerce.com Thanks for reporting this issue and providing us with the details. We have converted this issue into a mongodb server ticket and the triage team will take over it and provide further assistance if needed since it is the end of our Friday here in Sydney.

Comment by Rob Colella [ 01/Oct/21 ]

The nodes that crashed this morning were version 4.4.9
Yes, they crashed with the same duplicate key error.
I will see about running a validate command and let you know if anything is amiss

Comment by Chenhao Qu [ 01/Oct/21 ]

rob.colella@kibocommerce.com Thanks for the response. Can you run a validate and let us know? What's the version of the crashed nodes this morning? Are they crashed because of duplicate key error? We believe the node will not crash until the affected data is accessed so it is not strange that it runs several hours and then crash.

Comment by Rob Colella [ 01/Oct/21 ]

Also, what determine if the data is affected or not?
According to the release notes, all previous versions of 4.4.x seem to have this bug.
So if we had stayed on 4.4.4 it sounds like we would still of potentially seen this issue, correct?

Comment by Rob Colella [ 01/Oct/21 ]

Will running a validate command show if there is an issue or not?
We had two nodes crash again this morning in a different cluster and were forced to do a complete resync on them.
If the cluster was running fine on 4.4.8 and then we upgraded to 4.4.9, and the cluster was still operational for several hours before hitting the bug, why did it run fine for those few hours on 4.4.9 before crashing out?
All nodes were in sync and up to date, so its not like it would of been bad data in the oplog.

Comment by Chenhao Qu [ 01/Oct/21 ]

rob.colella@kibocommerce.com Have you encountered any more duplicate key issue after its first occurrence? Our theory is that after you upgraded to 4.4.8, your data was affected by the reported bug so it would still hit the duplicate key error after you upgraded to 4.4.9. If that's the case, it should running fine on 4.4.9 since then.

Comment by Chenhao Qu [ 30/Sep/21 ]

rob.colella@kibocommerce.com Thanks for the information. It is very useful for us to narrow down the problem.

Comment by Rob Colella [ 30/Sep/21 ]

This was during normal operation.
Nodes were first upgraded from 4.4.4 to 4.4.8 last night.
Upgrade went fine and repl set healthy.
After a few hours one of the nodes in a different cluster then this one fell over with the duplicate key bug.
Did a resync of the node to get back in healthy state.
Checked mongo site for release notes and saw reference to bug, immediately upgraded all nodes in repl set to 4.4.9

All nodes were running fine on v4.4.9 since last night with no issues.
A few hours ago received alert about down mongo service and upon investigation saw the duplicate key error in logs snippet above.
Confirmed packages on node are 4.4.9 and resynced node once again to get cluster in healthy state.

 

Installed packages are as follows:

mongodb-database-tools.aarch64        100.3.1-1                      @mongodb-org-4.4-tools
mongodb-org.aarch64                   4.4.9-1.amzn2                  @mongodb-org-4.4-alt
mongodb-org-database-tools-extra.aarch64
                                      4.4.9-1.amzn2                  @mongodb-org-4.4-alt
mongodb-org-mongos.aarch64            4.4.9-1.amzn2                  @mongodb-org-4.4-alt
mongodb-org-server.aarch64            4.4.9-1.amzn2                  @mongodb-org-4.4-alt
mongodb-org-shell.aarch64             4.4.9-1.amzn2                  @mongodb-org-4.4-alt
mongodb-org-tools.aarch64             4.4.9-1.amzn2                  @mongodb-org-4.4-alt

Comment by Chenhao Qu [ 30/Sep/21 ]

rob.colella@kibocommerce.com Thanks for your quick response. We are not aware any other customer hitting the same issue on 4.4.9. Can you describe what you have done and the state of the database before you hit this issue? I know you upgraded all the node to 4.4.9. Does this happen at the startup of one node in the upgrade process or during normal operations?

Comment by Rob Colella [ 30/Sep/21 ]

Unfortunately I was only able to grab the crash snippet of the logs before a resync as this is a critical system that needed to be up asap.
Has anyone else reported this same bug on v4.9.9?

 

{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"REPL",     "id":21238,   "ctx":"ReplWriterWorker-1681","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" } }: E11000 duplicate key error collection: xxxx.xxxx index: _id_ dup key: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" }","oplogEntry":{"lsid":{"id":{"$uuid":"059308d7-554a-46c0-a43a-333db2f8bc65"},"uid":{"$binary":{"base64":"0c1WQWCzMsnjOjzs6EcGcrtsXelc//fJF3oFnYwJqB4=","subType":"0"}}},"xxxx":xxxx,"op":"i","ns":"xxxx.xxxx","ui":{"$uuid":"5672efe5-dfa4-4bec-8661-b0b9c2da83e0"},"o":{"_id":"bb62e2b7dd714f89b9b3abceba139f23","Data":{"xxxx@xxxx":{"xxxx":"xxxx"}},"LastModifiedDate":{"$date":"2021-09-30T03:32:53.023Z"},"xxxx":xxxx,"xxxx":xxxx,"UserId":"xxxx"},"ts":{"$timestamp":{"t":1632972773,"i":1}},"t":8,"v":2,"wall":{"$date":"2021-09-30T03:32:53.023Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}}}}
{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"REPL",     "id":21235,   "ctx":"OplogApplier-0","msg":"Failed to apply batch of operations","attr":{"numOperationsInBatch":1,"firstOperation":{"lsid":{"id":{"$uuid":"059308d7-554a-46c0-a43a-333db2f8bc65"},"uid":{"$binary":{"base64":"0c1WQWCzMsnjOjzs6EcGcrtsXelc//fJF3oFnYwJqB4=","subType":"0"}}},"xxxx":xxxx,"op":"i","ns":"xxxx.xxxx","ui":{"$uuid":"5672efe5-dfa4-4bec-8661-b0b9c2da83e0"},"o":{"_id":"bb62e2b7dd714f89b9b3abceba139f23","Data":{"xxxx@xxxx":{"xxxx":"xxxx"}},"LastModifiedDate":{"$date":"2021-09-30T03:32:53.023Z"},"xxxx":xxxx,"xxxx":xxxx,"xxxx":"6b4bfc83ad2a421caef2481a10f0e0d7"},"ts":{"$timestamp":{"t":1632972773,"i":1}},"t":8,"v":2,"wall":{"$date":"2021-09-30T03:32:53.023Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"lastOperation":{"lsid":{"id":{"$uuid":"059308d7-554a-46c0-a43a-333db2f8bc65"},"uid":{"$binary":{"base64":"0c1WQWCzMsnjOjzs6EcGcrtsXelc//fJF3oFnYwJqB4=","subType":"0"}}},"xxxx":xxxx,"op":"i","ns":"xxxx.xxxx","ui":{"$uuid":"5672efe5-dfa4-4bec-8661-b0b9c2da83e0"},"o":{"_id":"bb62e2b7dd714f89b9b3abceba139f23","Data":{"xxxx@xxxx":{"xxxx":"xxxx"}},"LastModifiedDate":{"$date":"2021-09-30T03:32:53.023Z"},"xxxx":xxxx,"xxxx":xxxx,"xxxx":"6b4bfc83ad2a421caef2481a10f0e0d7"},"ts":{"$timestamp":{"t":1632972773,"i":1}},"t":8,"v":2,"wall":{"$date":"2021-09-30T03:32:53.023Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"failedWriterThread":10,"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" } }: E11000 duplicate key error collection: xxxx.xxxx index: _id_ dup key: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" }"}}
{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"-",        "id":23095,   "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" } }: E11000 duplicate key error collection: xxxx.xxxx index: _id_ dup key: { _id: \"bb62e2b7dd714f89b9b3abceba139f23\" }","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}}
{"t":{"$date":"2021-09-30T03:32:53.024+00:00"},"s":"F",  "c":"-",        "id":23096,   "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"}

Comment by Chenhao Qu [ 30/Sep/21 ]

rob.colella@kibocommerce.com Sorry to hear that you encountered this issue again. Can you share with us more information to diagnose the problem, e.g. any core dumps, more logs, and if possible, the data files?

Generated at Thu Feb 08 05:49:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.