[SERVER-61168] Mongo Secondary Oplog Keeps Getting Too Far Behind Primary Created: 01/Nov/21  Updated: 04/Nov/21  Resolved: 04/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Neil Allen Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We have a replica set comprising of a primary, secondary and arbiter. Our database sizes are 20TB+ and we have set our Oplog sizing so that it was 16 hours long the first time our secondary got too far behind. After repriming the secondary and extending the Oplog to 54 hours, we went another 3 weeks before we experienced this issue again over the weekend.

What specific things should we look for at the culprit for this? From documentation, it looks like disk I/O or network issues could potentially cause these issues but I'm not seeing any indication of that so far but really just want to check all of our bases before looking into that any more.

Would sharding out the database help with issues like this? Our database growth keeps climbing and we definitely need to start looking to see if sharding can solve this issue.

I can upload logs too if necessary. Thanks



 Comments   
Comment by Edwin Zhou [ 04/Nov/21 ]

Hi ncallen@cobaltiron.com,

After inspecting the log files and FTDC, it appears that the secondary node goes into ROLLBACK but is unable to catch up because the primary oplog is overflowing.

{"t":{"$date":"2021-10-30T06:46:33.183Z"},"s":"I","c":"ROLLBACK","id":21681,"ctx":"BackgroundSync","msg":"Starting rollback","attr":{"syncSource":"cspdb04ip.us2.cobaltiron.com:27017"}}
{"t":{"$date":"2021-10-30T06:46:33.422Z"},"s":"I","c":"ROLLBACK","id":21682,"ctx":"BackgroundSync","msg":"Finding the Common Point"}
{"t":{"$date":"2021-10-30T11:15:06.992Z"},"s":"W","c":"ROLLBACK","id":21728,"ctx":"BackgroundSync","msg":"Rollback cannot complete at this time (retrying later)","attr":{"error":"CappedPositionLost: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(7024107236637019329)","myLastAppliedOpTime":{"ts":{"_timestamp":{"t":1635411709,"i":1973}},"t":18},"minValid":{"ts":{"_timestamp":{"t":1635411709,"i":1973}},"t":18}}}
{"t":{"$date":"2021-10-30T11:15:06.992Z"},"s":"F","c":"ROLLBACK","id":4655800,"ctx":"BackgroundSync","msg":"Index builds stopped prior to rollback cannot be restarted by subsequent rollback attempts"}

You may be able to find guidance in our docs to helpavoid replica set rollbacks

The SERVER project is reserved for bug reports, so if you'd like to troubleshoot this further we'd like to encourage you to start by asking our community for help by posting on the MongoDB Developer Community Forums.

If the discussion there leads you to suspect a bug in the MongoDB server, then we'd want to investigate it as a possible bug here in the SERVER project.

Best,
Edwin

Comment by Neil Allen [ 02/Nov/21 ]

Logs have been uploaded to support uploader as cspdb03ip-2021-10-30.log.bz2 (mongo log from day of crash) and diagnostic data cspdb03ip-diagnostic-data.tar.gz

Comment by Edwin Zhou [ 02/Nov/21 ]

Hi ncallen@cobaltiron.com,

Thanks for you for your report. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Best,
Edwin

Generated at Thu Feb 08 05:51:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.