[SERVER-20328] Allow secondary reads while applying oplog entries Created: 09/Sep/15 Updated: 15/Nov/21 Resolved: 13/Apr/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Scott Hernandez (Inactive) | Assignee: | Louis Williams |
| Resolution: | Duplicate | Votes: | 22 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Currently replication applies batches of oplog entries holding a lock to ensure that no reads can consume data which is not in the same causal order as the primary. Instead of locking and blocking readers we can instead use a snapshot for reads from the last consistent replication state and essentially hide all writes until we reach a new consistent state. In addition to detaching replication writes from affecting readers (both users and other replicas) this also allows storage engines to optimize writing of replicated data to improve performance and reduce IO, since all replicated data is transient and disposable until committed. |
| Comments |
| Comment by Louis Williams [ 13/Apr/18 ] |
|
Completed by |
| Comment by Zhang Youdong [ 17/May/17 ] |
|
@deyukong @Eric We also encountered this issue, as mentioned in SERVER-24661 , and it's linked to this Read From Snapshots issue, which making no progress all the time. WT-3181 is really helpful for solve this problem. I want to know the release plan, will it be included in mongodb-3.6? |
| Comment by deyukong [ 16/May/17 ] |
|
Yes, you're right. |
| Comment by Eric Milkie [ 16/May/17 ] |
|
If a majority of secondaries are stale, you are already in dangerous territory as writes are not propagating to a majority of nodes (and writes done with a write concern of w:majority will time out). For such a situation, it will be possible for the secondary nodes to accumulate snapshot data until cache memory is full and then start spilling the snapshot data to disk. The solution is to use w:majority writes, or set up monitoring to avoid running out of disk space. |
| Comment by deyukong [ 16/May/17 ] |
|
quoted: In the current readMajority implementation, only 1000 snapshots will be held. If you try to keep snapshots since the committed-point in the snapshot-manager, there is a chance that a secondary will accumulate so much data on disk once majority of secondaries are stale but one is not. |
| Comment by Eric Milkie [ 16/May/17 ] |
|
Once the timestamp project is fully completed, you will be able to do a point-in-time read on a node at any optime between now and the majority commit point with snapshot isolation. In essence, we will be making snapshots automatically for every op in the oplog. The data enabling these snapshots will be cleaned up as the commit point moves forward; in most cases it will not be possible to begin a read at a point in time prior to the current commit point (but in-flight reads will pin their snapshot until they are finished). |
| Comment by deyukong [ 16/May/17 ] |
|
Hi, @Eric Mikie |
| Comment by Eric Milkie [ 16/May/17 ] |
|
Hi Deyu Kong, |
| Comment by deyukong [ 15/May/17 ] |
|
I'm one from tencent-cloud-mongodb team, in our real-world case, there are many many customs troubled by this problem. There must be some points that I've missed, and needs further discussions. I hope the official mongo guys can participate in. |