[SERVER-36427] Sharding cluster unable to sync and mongodump fails. Created: 03/Aug/18  Updated: 15/Sep/18  Resolved: 21/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Prasad Surase Assignee: Nick Brewer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 1.png     PNG File 2.png     PNG File 3.png    
Operating System: ALL
Participants:

 Description   

I have a 13 server mongodb cluster consisting of 1 query router, 3 config servers with replication and 3 shards, each with replication(primary, secondary and arbiter). Its installed on AWS-EC2 R series instances. Monit is used to restart the mongodb service incase it exceeds 95% memory usage.

My Shard3Primary failed and the Shard3Secondary became primary(as expected). The problem is that the Shard3Primary mongodb process isnt able to restart stating 

Initializing full-time diagnostic data capture with directory '/data_storage/data/diagnostic.data'
2018-08-03T03:59:12.144+0000 I REPL [initandlisten] Rollback ID is 210
2018-08-03T03:59:12.145+0000 I REPL [initandlisten] Starting recovery oplog application at the appliedThrough: { ts: Timestamp(1533190391, 15335), t: 454 }
2018-08-03T03:59:12.145+0000 I REPL [initandlisten] Replaying stored operations from { : Timestamp(1533190391, 15335) } (exclusive) to { : Timestamp(1533190418, 1) } (inclusive).
2018-08-03T03:59:12.145+0000 F REPL [initandlisten] Oplog entry at { : Timestamp(1533190391, 15335) } is missing; actual entry found is { : Timestamp(1533190393, 1) }
2018-08-03T03:59:12.145+0000 F - [initandlisten] Fatal Assertion 40292 at src/mongo/db/repl/replication_recovery.cpp 218
2018-08-03T03:59:12.145+0000 F - [initandlisten]

***aborting after fassert() failure

I tried to take the mongodump on the QueryRouter and it failed too.(the same command had succeeded for earlier dumps).  

I have attached the screenshots of Shard3Primary for your reference.



 Comments   
Comment by Nick Brewer [ 21/Aug/18 ]

The fix for this behavior has been released in MongoDB 3.6.7.

-Nick

Comment by Nick Brewer [ 03/Aug/18 ]

prasadsurase I believe you're running into a known issue that is detailed here: SERVER-34895

A fix to prevent this behavior is in MongoDB 4.0, but it hasn't been backported to 3.6 yet. In the meantime, you'll need to perform an initial sync to restore the primary.

I'll update this ticket once the fix is introduced in 3.6.

-Nick

Generated at Thu Feb 08 04:43:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.