[SERVER-59403] upgrade from 4.4.4 to 4.4.8,secondary crash indicates _id duplicate key Created: 17/Aug/21  Updated: 07/Sep/22  Resolved: 18/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: jing xu Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Related
related to WT-7995 Fix the global visibility so that it ... Closed
Operating System: ALL
Participants:

 Description   

my cluster is 4.4.8 with three shards.
cluster upgrade from 4.4.4 to 4.4.8. cluster is ok.after one hours,there is secondary crash ,indicates _id duplicate key .
{"t":

{"$date":"2021-08-17T18:54:43.182+08:00"}

,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-0","msg":"Writer worker caught exception","attr":{"error":"DuplicateKey{ keyPattern:

{ _id: 1 }

, keyValue:

{ _id: \"0be47ed91b91474f8a96429fa7d3cfec\" }

}: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key:

{ _id: \"0be47ed91b91474f8a96429fa7d3cfec\" }

"



 Comments   
Comment by Eric Sedor [ 18/Oct/21 ]

601290552@qq.com I'm going to close this ticket as a duplicate of WT-7995. Please comment there if you end up needing clarification on the remediation steps we provided.

Comment by Eric Sedor [ 04/Oct/21 ]

Hi 601290552@qq.com, I wanted to clarify: it is possible for WT-7995 or WT-7984 to persist inconsistencies that cause a crash later, not just on startup.

So we recommend upgrading to 4.4.9 and performing the remediation steps suggested in those tickets. Does that make sense?

Comment by jing xu [ 25/Sep/21 ]

Hi eric:
i read this two issue about duplicate key.it indicates: db crash or unclean shutdown.
but i shutdown db using normal not abort.

secondary:
The bug can cause a Duplicate Key error on startup and prevent the node from starting.

my 4.4.8 secondary can startup normal,it is runing a few hours that indicates
Duplicate Key error not startup.

Comment by Eric Sedor [ 23/Sep/21 ]

Hi 601290552@qq.com,

I'm very sorry for the delay. I believe you should upgrade to 4.4.9 (just released) as it is very likely from the occurrence of duplicate key errors that you've been impacted by wither WT-7995 or WT-7984, which are fixed in this version.

Eric

Comment by jing xu [ 20/Aug/21 ]

Hi Eric:
Have you made any progress for it?thanks.

Comment by jing xu [ 18/Aug/21 ]

Hi Eric:
My cluster version upgrade from 4.4.4 to 4.4.8.cluster has three shards.
every shard has three members(PSS).i upgrade frist:config-shard-mongos.i will upload shard2 data.
upgrade shard2 step:
first upgrade all secondary member.i shutdown instance normal.after i use 4.4.8 binary mongo start it.
when i upgrade primary.i execute rs.stepDown(600).because it has high priority.after primary becomes seconday,i shutdown it normal.
after i use 4.4.8 binary mongo start it.After the upgrade is complete, the system runs for more than 8 hours.

Then, can you elaborate on the timeline of what has occurred here?
yes,it is ok.

shard2 seconday is affected nodes(srvdb303.yto.cloud)
1、shutdown using 4.4.4 time:2021-08-17T16:12:36.840+08:00
{"t":

{"$date":"2021-08-17T16:12:36.840+08:00"}

,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}}
2、startup using 4.4.8 time:2021-08-17T16:12:44.132+08:00
{"t":

{"$date":"2021-08-17T16:12:44.132+08:00"}

,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"}
3、crash using 4.4.8 time:2021-08-18T10:07:19.822+08:00
[from the following log:this duplicate _id is op:u,update lead to this duplicate]
{"t":

{"$date":"2021-08-18T10:07:19.821+08:00"}

,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception",
"attr":{"error":"DuplicateKey{ keyPattern:

{ _id: 1 }

, keyValue:

{ _id: \"a148c2bd1c274559b674ef3eddb46d01\" }

}:
E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key:

{ _id: \"a148c2bd1c274559b674ef3eddb46d01\" }

",
"oplogEntry":{"op":"u","ns":"customerService.TrackerDetail","ui":

{"$uuid":"6345c1d9-8e32-4d82-a2bb-b6c6313f17c1"}

,
"o":{"$v":1,"$set":{"opTime":

{"$date":"2021-08-18T02:07:19.000Z"}

,
"opUt":

{"$date":"2021-08-18T02:07:19.824Z"}

}},"o2":

{"no":"xxxx","_id":"a148c2bd1c274559b674ef3eddb46d01"}

,
"ts":{"$timestamp":{"t":1629252439,"i":252}},"t":8,"v":2,"wall":{"$date":"2021-08-18T02:07:19.819Z"}}}}

{"t":

{"$date":"2021-08-18T10:07:19.822+08:00"}

,"s":"F", "c":"-", "id":23095, "ctx":"OplogApplier-0","msg":"Fatal assertion","attr":{"msgid":34437,"error":"DuplicateKey{ keyPattern:

{ _id: 1 }

,
keyValue:

{ _id: \"a148c2bd1c274559b674ef3eddb46d01\" }

}: E11000 duplicate key error collection: customerService.TrackerDetail index: id dup key:

{ _id: \"a148c2bd1c274559b674ef3eddb46d01\" }

","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":510}}
{"t":

{"$date":"2021-08-18T10:07:19.822+08:00"}

,"s":"F", "c":"-", "id":23096, "ctx":"OplogApplier-0","msg":"\n\n***aborting after fassert() failure\n\n"}

1、The logs for the affected nodes, including before, leading up to, and after the first sign of corruption.
2、a clear timeline of activities (with timestamps and timezones) including server restarts, replica set changes, version changes, etc.
1、shutdown using 4.4.4 time:2021-08-17T16:12:36.840+08:00
{"t":

{"$date":"2021-08-17T16:12:36.840+08:00"}

,"s":"I", "c":"CONTROL", "id":23138, "ctx":"conn362334","msg":"Shutting down","attr":{"exitCode":0}}
2、startup using 4.4.8 time:2021-08-17T16:12:44.132+08:00
{"t":

{"$date":"2021-08-17T16:12:44.132+08:00"}

,"s":"I", "c":"CONTROL", "id":20698, "ctx":"main","msg":"***** SERVER RESTARTED *****"}
3、crash using 4.4.8 time:2021-08-18T10:07:19.822+08:00
[from the following log:this duplicate _id is op:u,update lead to this duplicate]
{"t":

{"$date":"2021-08-18T10:07:19.821+08:00"}

,"s":"F", "c":"REPL", "id":21238, "ctx":"ReplWriterWorker-1777","msg":"Writer worker caught exception",

3、the output of validate() on each affected node
i need to enter standalone so that i can execute validate.
db.runCommand(

{ validate: "TrackerDetail", full: true }

)

"advice" : "A corrupt namespace has been detected. See http://dochub.mongodb.org/core/data-recovery for recovery steps.",
why i shutdown it normal.it can corrupt.

i have upload three files to support uploader location.

curl -X POST https://upload.box.com/api/2.0/files/content -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' -H 'Content-Type: multipart/form-data' -F attributes='{"name": "diagnostic.data.tar", "parent": {"id": "143586341049"}}' -F file=@diagnostic.data.tar > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 192M 100 1171 100 192M 44 7503k 0:00:26 0:00:26 -::- 0

curl -X POST https://upload.box.com/api/2.0/files/content \
> -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' \
> -H 'Content-Type: multipart/form-data' \
> -F attributes='{"name": "shard2.log", "parent": {"id": "143586341049"}}' \
> -F file=@shard2.log > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 84.9M 100 1161 100 84.9M 12 922k 0:01:36 0:01:34 0:00:02 0

curl -X POST https://upload.box.com/api/2.0/files/content \
> -H 'Authorization: Bearer 1!uKuATr9f9QobvE9AHsQ_XR44g2Igm3uZzp0s_N91uYhMJ1sGZDLfivIg6zjWUjwF352nq59XPNE0eZyt53AE5fXpHUmfFxV5DeJQ4HItqE_1rMu9QbN6xIiwKhYo_caHj_xME3IB5iESmpg0V8X0KX5A94deYuyKGtZuViQQcmMahbsFm3r5FqapqwW6MfjUWZdmNAjmRpkmDBybnJ7PrN1mOcar7OUMXE0p1toQQ-7BWvUUtK2VzH9flsBoqA8E2Z_kvMxE_aUZhDrB4TImv1IGpzC6pGrCtjK8y-Mu6Fh7N89ozs2cJFGx6FryzuWIyDizzh3bp1ufE6PEm_ieOwLNgYHYRChNabeHyBwUEAY6xu5WEuZiOdJhUDHMW2DWAf-TTGqdFbpkR489Q1FnAy-oOEY.' \
> -H 'Content-Type: multipart/form-data' \
> -F attributes='{"name": "validate.txt", "parent": {"id": "143586341049"}}' \
> -F file=@validate.txt > /dev/null
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 37433 100 1160 100 36273 519 16253 0:00:02 0:00:02 -::- 16258

Comment by jing xu [ 18/Aug/21 ]

Hi Eric:
Slowly, all the seconday instances started reporting duplicate key errors, causing the instance to go down.so i guess that it can not logical corruption for all secondays except primary.the instance is the small,Otherwise, no synchronization is completed, and only the primary instance is alive and eventually becomes read-only.

Comment by jing xu [ 18/Aug/21 ]

Hi Eric:
I will sort through the process for you. It needs some time.

Comment by Eric Sedor [ 17/Aug/21 ]

Hi 601290552@qq.com,

This error message leads us to suspect logical corruption. Please make a complete copy of each affected node's $dbpath directory to safeguard so that you can work off of the current $dbpath.

Our ability to determine the source of this corruption depends greatly on your ability to provide:

  1. The logs for the affected nodes, including before, leading up to, and after the first sign of corruption.
  2. a clear timeline of activities (with timestamps and timezones) including server restarts, replica set changes, version changes, etc.
  3. the output of validate() on each affected node

Would you please archive (tar or zip) the validate output, mongod.log files, and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Then, can you elaborate on the timeline of what has occurred here?

The ideal resolution will likely be to perform a clean resync from an unaffected node.

Thank you,
Eric

Comment by jing xu [ 17/Aug/21 ]

i check from primary:
this record create at "createTime" : ISODate("2021-08-14T03:03:00.557Z"),
i find "currentTime" : ISODate("2021-08-17T17:40:25Z"),
this time the same from secondary carsh time.so it is maybe update at primary.
but i can not find this record on secondary.why it indicate duplicate key
"createTime" : ISODate("2021-08-14T03:03:00.557Z"),
"rBranch" : "769902",
"rRegion" : "755920",
"_class" : "cn.yto.wdgj.rw.pojo.mongodo.WantedMonitorNewDO",
"contractOrg" : null,
"currentOrg" : "769902",
"currentTime" : ISODate("2021-08-17T17:40:25Z"),

Comment by jing xu [ 17/Aug/21 ]

another seconday for shard3 again carsh:
{"t":

{"$date":"2021-08-18T03:40:38.987+08:00"}

,"s":"F", "c":"REPL", "id":21235, "ctx":"initandlisten","msg":"Failed to apply batch of operations","attr":{"numOperationsInBatch":1013,"firstOperation":{"lsid":{"id":

{"$uuid":"e0556e1e-9897-41c1-a4dc-6afa6d9e50db"}

,"uid":{"$binary":

{"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"}

}},"txnNumber":2339,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui":

{"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"}

,"o":{"$v":1,"$set":{"backFlag":1,"currentOp":751,"currentTime":

{"$date":"2021-08-17T17:43:19.000Z"}

,"dealStatus":19,"dealTime":

{"$date":"2021-08-17T17:43:19.207Z"}

,"opList.7":162922219900751,"orgBranch":"579038"}},"o2":{"_id":{"$oid":"6118aaf405f34e22a4f681a4"}},"ts":{"$timestamp":{"t":1629222199,"i":6}},"t":6,"v":2,"wall":

{"$date":"2021-08-17T17:43:19.212Z"}

,"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"lastOperation":{"lsid":{"id":

{"$uuid":"783598dd-63df-4489-9505-1ab1923be921"}

,"uid":{"$binary":

{"base64":"kafFGlFnFmnb4qY6Zo1wZcN0z6AuZ9x3brpDqqhzH/U=","subType":"0"}

}},"txnNumber":3065,"op":"u","ns":"returnMonitor.exp_want_new_monitor","ui":

{"$uuid":"0814628c-47c0-4210-a9d4-1c7c0a4161ee"}

,"o":{"$v":1,"$set":{"currentOp":171,"currentOrg":"270902","currentTime":

{"$date":"2021-08-17T17:44:43.000Z"}

,"nextOrg":"712001","opList.5":162922228300171,"warnLevel":20}},"o2":{"_id":{"$oid":"6119bda405f34e22a4f78351"}},"ts":{"$timestamp":{"t":1629222294,"i":2}},"t":6,"v":2,"wall":

{"$date":"2021-08-17T17:44:54.404Z"}

,"stmtId":0,"prevOpTime":{"ts":{"$timestamp":{"t":0,"i":0}},"t":-1}},"failedWriterThread":12,"error":"DuplicateKey{ keyPattern:

{ _id: 1 }

, keyValue:

{ _id: ObjectId('611732d899faf22fbb509820') }

}: E11000 duplicate key error collection: returnMonitor.exp_want_new_monitor index: id dup key:

{ _id: ObjectId('611732d899faf22fbb509820') }

"}}

when i enter standalone execute find,database not find this record.
>
> db.exp_want_new_monitor.find({_id: ObjectId('611732d899faf22fbb509820')})
> db.exp_want_new_monitor.count();
17806952

Comment by jing xu [ 17/Aug/21 ]

when i initiate secondary,from log indicates:
{"t":

{"$date":"2021-08-17T19:14:52.076+08:00"}

,"s":"I", "c":"NETWORK", "id":22990, "ctx":"conn245","msg":"DBException handling request, closing client connection","attr":{"error":"NotWritablePrimary: Not-primary error while processing 'find' operation on 'returnMonitor' database via fire-and-forget command execution."}}

Generated at Thu Feb 08 05:47:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.