Details
-
Task
-
Resolution: Won't Fix
-
Major - P3
-
None
-
None
-
Probably on the oplog resize page or on prod notes
Description
Replication oplog window should cover normal maintenance/downtime windows (to avoid full resync), and also time needed to restore a RS member (either time needed for initial sync, or time to restore the last backup so it will not become stale).
To estimate the size you should look at the oplog rate (oplog GB/h graph in MMS or use the db.printReplicationInfo() output) at it's maximum (i.e. during the load peak).
The default oplog size should never be used without consideration.
For the sake of example, let's say my system has the following timings:
- usual maintenance window: 24 hours
- time to restore a backup is 3 hours
- backups are taken every 24 hours
- time needed for an initial sync: 6 hours
- oplog rate is 3 GB/h at peak time
Here I would need to cover at least 24 hours for the usual maintenance window (it also covers the time needed for an initial sync, which is 6 hours). However, if I were to restore a backup that is 24 hours behind, and the restore process takes 3 hours, I would need at least 27 hours in the oplog to allow the restored replica set member to catch up.
It's practical to round it up to 36 hours, which is 108GB at the 3 GB/h rate. Further, it makes sense to account for sudden load spikes and some future growth of the oplog rate, so I would use 200GB instead.