[SERVER-42734] when start shard and the error:DuplicateKey: E11000 duplicate key error collection: config.cache.chunks Created: 09/Aug/19  Updated: 19/Sep/19  Resolved: 19/Sep/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chen Jian Assignee: Siyuan Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File shard.log    
Operating System: ALL
Sprint: Repl 2019-08-26, Repl 2019-09-09, Repl 2019-09-23
Participants:

 Description   

at first, one of my primary shards running out of disk space because of duplicated key error log. 
 
Now one of the primary shard is down and can not start again



 Comments   
Comment by Kelsey Schubert [ 19/Sep/19 ]

Hi chenjian@tmxmall.com,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Kelsey

Comment by Siyuan Zhou [ 13/Aug/19 ]

Hi chenjian@tmxmall.com, sorry to hear the failure.

We need more data to investigate the root cause of this issue. Since you mentioned the issue happened before the restart and crash, could you please post the log before the crash? We also need the content of "config" database on the crashed node and the oplog on the node. We need the data before the restore procedure below. You can dump the data with mongodump and upload it to this ticket after compression.

  1. Start the node that crashed sh.tmxmall.mongodb06:62801 without --replSet, so that it will be in standalone mode and not join the replset.
  2. mongodump --host sh.tmxmall.mongodb06 --port 62801 --db local --archive=local.gz --gzip

  3. mongodump --host sh.tmxmall.mongodb06 --port 62801 --db config --archive=config.gz --gzip

  4. Upload local.gz and config.gz as attachments to this ticket.
  5. Upload the log of sh.tmxmall.mongodb06:62801 before the crash.

To recover from the failure, you need to remove all the documents of the collection in question. This is safe because the collection is a cache used by sharding and will be recreated by sharding.

  1. Start the node that crashed sh.tmxmall.mongodb06:62801 without --replSet, so that it will start in standalone mode and not join the replset.
  2. Remove all documents in config.cache.chunks.pr_tmxbase.toffs_6.chunks on the node by doing

    use config
    cache.chunks.pr_tmxbase.toffs_6.chunks.remove({})
    

    Please note this is different from dropping a collection, which we cannot do right now.

  1. Restart the node sh.tmxmall.mongodb06:62801 with --replSet <your replset name>. Startup recovery should succeed and pass the failure point we saw in your log. The node should join the replset as a secondary successfully. All nodes in the replset are running but the data on sh.tmxmall.mongodb06:62801 is inconsistent with that on primary.
  2. We need to fix the inconsistency between primary and the crashed node. Log on the primary and drop the collection.

    use config
    cache.chunks.pr_tmxbase.toffs_6.chunks.drop()
    

  3. After dropping the inconsistent collection via primary, the crashed node will replicate the drop command. Now the data should be consistent. You may run dbhash on config database to verify that. Now the crashed node recovered to a normal secondary state.

During the whole procedure, the crashed node cannot become primary, so please make sure you run the commands when the replset is stable with primary other than the crashed node. After the procedure, the crashed node should become a normal secondary and can run for elections.

Comment by Chen Jian [ 13/Aug/19 ]

Hi,Can you tell me how to restore this node first

Comment by Chen Jian [ 09/Aug/19 ]

mongos> db.collections.find(\{_id: 'pr_tmxbase.toffs_6.chunks'});
{ "_id" : "pr_tmxbase.toffs_6.chunks", "lastmodEpoch" : ObjectId("5d297bc1c10ec7ebdbf0fabf"), "lastmod" : ISODate("1970-02-19T17:02:47.299Z"), "dropped" : false, "key" : \{ "files_id" : "hashed" }, "unique" : false, "uuid" : UUID("9dcdff1c-4f4e-4457-ad9b-e55633e376b0") }

Comment by Kaloian Manassiev [ 09/Aug/19 ]

chenjian@tmxmall.com, what is the shard key of the pr_tmxbase.toffs_6.chunks collection? Can you please run this query against the cluster and include the output:

use config;
db.collections.find({_id: 'pr_tmxbase.toffs_6.chunks'});

Generated at Thu Feb 08 05:01:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.