[SERVER-59403] upgrade from 4.4.4 to 4.4.8,secondary crash indicates _id duplicate key Created: 17/Aug/21 Updated: 07/Sep/22 Resolved: 18/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | jing xu | Assignee: | Eric Sedor |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
my cluster is 4.4.8 with three shards. ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-0","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: \"0be47ed91b91474f8a96429fa7d3cfec\" }}: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key: { _id: \"0be47ed91b91474f8a96429fa7d3cfec\" }" |
| Comments |
| Comment by Eric Sedor [ 18/Oct/21 ] |
|
601290552@qq.com I'm going to close this ticket as a duplicate of |
| Comment by Eric Sedor [ 04/Oct/21 ] |
|
Hi 601290552@qq.com, I wanted to clarify: it is possible for So we recommend upgrading to 4.4.9 and performing the remediation steps suggested in those tickets. Does that make sense? |
| Comment by jing xu [ 25/Sep/21 ] |
|
Hi eric: secondary: my 4.4.8 secondary can startup normal,it is runing a few hours that indicates |
| Comment by Eric Sedor [ 23/Sep/21 ] |
|
Hi 601290552@qq.com, I'm very sorry for the delay. I believe you should upgrade to 4.4.9 (just released) as it is very likely from the occurrence of duplicate key errors that you've been impacted by wither Eric |
| Comment by jing xu [ 20/Aug/21 ] |
|
Hi Eric: |
| Comment by jing xu [ 18/Aug/21 ] |
|
Hi Eric: Then, can you elaborate on the timeline of what has occurred here? shard2 seconday is affected nodes(srvdb303.yto.cloud) ,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}} ,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"} ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception", , keyValue: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" } }: ", , , }},"o2": {"no":"xxxx","_id":"a148c2bd1c274559b674ef3eddb46d01"}, {"t": {"$date":"2021-08-18T10:07:19.822+08:00"},"s":"F", "c":"-", "id":23095, "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"DuplicateKey{ keyPattern: { _id: 1 }, }: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key: { _id: \"a148c2bd1c274559b674ef3eddb46d01\" }","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}} ,"s":"F", "c":"-", "id":23096, "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"} 1、The logs for the affected nodes, including before, leading up to, and after the first sign of corruption. ,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}} ,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"} ,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception", 3、the output of validate() on each affected node ) "advice" : "A corrupt namespace has been detected. See http://dochub.mongodb.org/core/data-recovery for recovery steps.", i have upload three files to support uploader location. curl -X POST https://upload.box.com/api/2.0/files/content -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' -H 'Content-Type: multipart/form-data' -F attributes='{"name": "diagnostic.data.tar", "parent": {"id": "143586341049"}}' -F file=@diagnostic.data.tar > /dev/null curl -X POST https://upload.box.com/api/2.0/files/content \ curl -X POST https://upload.box.com/api/2.0/files/content \ |
| Comment by jing xu [ 18/Aug/21 ] |
|
Hi Eric: |
| Comment by jing xu [ 18/Aug/21 ] |
|
Hi Eric: |
| Comment by Eric Sedor [ 17/Aug/21 ] |
|
Hi 601290552@qq.com, This error message leads us to suspect logical corruption. Please make a complete copy of each affected node's $dbpath directory to safeguard so that you can work off of the current $dbpath. Our ability to determine the source of this corruption depends greatly on your ability to provide:
Would you please archive (tar or zip) the validate output, mongod.log files, and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Then, can you elaborate on the timeline of what has occurred here? The ideal resolution will likely be to perform a clean resync from an unaffected node. Thank you, |
| Comment by jing xu [ 17/Aug/21 ] |
|
i check from primary: |
| Comment by jing xu [ 17/Aug/21 ] |
|
another seconday for shard3 again carsh: ,"s":"F", "c":"REPL", "id":21235, "ctx":"initandlisten","msg":"Failed to apply batch of operations","attr":{"numOperationsInBatch":1013,"firstOperation":{"lsid":{"id": {"$uuid":"e0556e1e-9897-41c1-a4dc-6afa6d9e50db"},"uid":{"$binary": {"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"}}},"txnNumber":2339,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui": {"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"},"o":{"$v":1,"$set":{"backFlag":1,"currentOp":751,"currentTime": {"$date":"2021-08-17T17:43:19.000Z"},"dealStatus":19,"dealTime": {"$date":"2021-08-17T17:43:19.207Z"},"opList.7":162922219900751,"orgBranch":"579038"}},"o2":{"_id":{"$oid":"6118aaf405f34e22a4f681a4"}},"ts":{"$timestamp":{"t":1629222199,"i":6}},"t":6,"v":2,"wall": {"$date":"2021-08-17T17:43:19.212Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"lastOperation":{"lsid":{"id": {"$uuid":"783598dd-63df-4489-9505-1ab1923be921"},"uid":{"$binary": {"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"}}},"txnNumber":3065,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui": {"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"},"o":{"$v":1,"$set":{"currentOp":171,"currentOrg":"270902","currentTime": {"$date":"2021-08-17T17:44:43.000Z"},"nextOrg":"712001","opList.5":162922228300171,"warnLevel":20}},"o2":{"_id":{"$oid":"6119bda405f34e22a4f78351"}},"ts":{"$timestamp":{"t":1629222294,"i":2}},"t":6,"v":2,"wall": {"$date":"2021-08-17T17:44:54.404Z"},"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"failedWriterThread":12,"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: ObjectId('611732d899faf22fbb509820') }}: E11000 duplicate key error collection: returnMonitor.exp_want_new_monitor index: id dup key: { _id: ObjectId('611732d899faf22fbb509820') }"}} when i enter standalone execute find,database not find this record. |
| Comment by jing xu [ 17/Aug/21 ] |
|
when i initiate secondary,from log indicates: ,"s":"I", "c":"NETWORK", "id":22990, "ctx":"conn245","msg":"DBException handling request, closing client connection","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'find' operation on 'returnMonitor' database via fire-and-forget command execution."}} |