[SERVER-20328] Allow secondary reads while applying oplog entries Created: 09/Sep/15  Updated: 15/Nov/21  Resolved: 13/Apr/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Scott Hernandez (Inactive) Assignee: Louis Williams
Resolution: Duplicate Votes: 22
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-34192 Secondary reads during batch applicat... Closed
is duplicated by SERVER-21858 A high throughput update workload in ... Closed
is duplicated by SERVER-24661 Secondary block reader a very long ti... Closed
Related
related to WT-2649 Some way to indicate valid points in ... Closed
related to SERVER-21862 Use record store directly to read fro... Closed
is related to SERVER-25168 Foreground index build blocks all R/W... Closed
is related to SERVER-5729 Special-case concurrency model for op... Closed
is related to SERVER-6883 index creation on secondaries need no... Closed
is related to SERVER-29123 Why is ParallelBatchWriterMode used w... Closed
is related to SERVER-31359 when large inserts into mongo, lots ... Closed
Participants:
Case:

 Description   

Currently replication applies batches of oplog entries holding a lock to ensure that no reads can consume data which is not in the same causal order as the primary.

Instead of locking and blocking readers we can instead use a snapshot for reads from the last consistent replication state and essentially hide all writes until we reach a new consistent state. In addition to detaching replication writes from affecting readers (both users and other replicas) this also allows storage engines to optimize writing of replicated data to improve performance and reduce IO, since all replicated data is transient and disposable until committed.



 Comments   
Comment by Louis Williams [ 13/Apr/18 ]

Completed by SERVER-34192

Comment by Zhang Youdong [ 17/May/17 ]

@deyukong @Eric

We also encountered this issue, as mentioned in SERVER-24661 , and it's linked to this Read From Snapshots issue, which making no progress all the time.

WT-3181 is really helpful for solve this problem. I want to know the release plan, will it be included in mongodb-3.6?

Comment by deyukong [ 16/May/17 ]

Yes, you're right.
If customers use readMajority and writeMajority, the problem will not happen. But I think more consideration should be taken here since customers may not be willing to use read/write Majority for some reasons.
A monitor to watch disk usage can also solve this problem and actually it is a tradeoff here.
The work of WT-3181 is great and worth appreciated.

Comment by Eric Milkie [ 16/May/17 ]

If a majority of secondaries are stale, you are already in dangerous territory as writes are not propagating to a majority of nodes (and writes done with a write concern of w:majority will time out). For such a situation, it will be possible for the secondary nodes to accumulate snapshot data until cache memory is full and then start spilling the snapshot data to disk. The solution is to use w:majority writes, or set up monitoring to avoid running out of disk space.

Comment by deyukong [ 16/May/17 ]

quoted:
The data enabling these snapshots will be cleaned up as the commit point moves forward

In the current readMajority implementation, only 1000 snapshots will be held. If you try to keep snapshots since the committed-point in the snapshot-manager, there is a chance that a secondary will accumulate so much data on disk once majority of secondaries are stale but one is not.
Commited-point is unpredictable, I would rather do named snapshots manually to avoid the accumulation of an alive secondary

Comment by Eric Milkie [ 16/May/17 ]

Once the timestamp project is fully completed, you will be able to do a point-in-time read on a node at any optime between now and the majority commit point with snapshot isolation. In essence, we will be making snapshots automatically for every op in the oplog. The data enabling these snapshots will be cleaned up as the commit point moves forward; in most cases it will not be possible to begin a read at a point in time prior to the current commit point (but in-flight reads will pin their snapshot until they are finished).

Comment by deyukong [ 16/May/17 ]

Hi, @Eric Mikie
That's really a great feature, with this feature, each oplog will be exactly mapped to some version of a key-value pair in wt. But I think it will be hard to deal with meta-data changing, Such as renameCollection or something similar, I'm not quite sure if renameCollection will change the wt-related idents or not, I will review this point later.
You said, quoted below
"
With this new feature, we'll be able to allow reads on secondaries for both read concern level majority and read concern level local without blocking while a batch is being applied
"
Certainly it will be more convient to do this with your new feature. But I think it is not a sine qua non. It may not be convient to do and release named-snapshot before and after the batchApplyOplog, but it is able to fulfill the task.
Or possibily, WT-3181 is one milestone of the roadmap to do snapshot-read on secondaries ?

Comment by Eric Milkie [ 16/May/17 ]

Hi Deyu Kong,
We are currently working on a project WT-3181 to label all writes with timestamps, so that point-in-time reads will be possible without needing to create named snapshots ahead of time. With this new feature, we'll be able to allow reads on secondaries for both read concern level majority and read concern level local without blocking while a batch is being applied.

Comment by deyukong [ 15/May/17 ]

I'm one from tencent-cloud-mongodb team, in our real-world case, there are many many customs troubled by this problem.
Let's talk about one of our customer, the most famous shared-bicycle service provider in china(the name is skipped here by security)
there are two period of time during a day when people go to work in the morning or get off work in the afternoon. During this
period, tps may be four times as high as the normal time. Although we route reads into the secondaries and use primary only to serve writes. We still find slow queries got extremely high, maybe hundards of times as the normal time.
After some time of debug, we finally find the PBWM lock is the main reason. When primary writes heavily, secondaries will take
lots of time acquiring global intent lock(in our case, all data are cached in memory, primary write/update may be 10000--20000/s at peak, each slave costs 50-200ms mostly for applying oplogs, we got the time of applying oplogs because we found it and we add logs around), the time to apply oplog is very unstable, sometimes less than 10ms, sometimes greater than 500ms.
The unstable performance makes the time of reading slave unpredictable, even a simple key-value read can cost very long time. it is untolerable to a database. We think the most important thing of a database is the stablilization, then the performance, the last important is the variaty of features.
We are hoping the offical mongodb guys to focus on this problem and list out the roadmap to solve it.
As one of the community, I have some ideas to share here, ideas about how to solve this problem.
1) Primary must not read from snapshot, because the read/write consistency may be destroyed to do so.
2) Secondaries do a snapshot before or after BatchApplyOplogs( I have not tested wt's performance of snapshot, but I know each read or write of mongorocks uses a snapshot, if frequent snapshots is a problem, tradeoff between frequency and latency
should be considered)
3) Reads can be classified into two kinds, reading from the process itself and reading from the clients(other mongod/s included)
4) Add some flag to recoverUnit to read from snapshot, just like what you did to implement readMajority, track from the beginning of each reads from client, set this flag and they will read from the lastest snapshot
5) The modules reads inside mongod process as far as I know is:
5.1) get the latest oplog in cluster-heartbeat for raft replication. it should also read from the latest snapshot, otherwise, an oplog generated from the partial-oplogBatchApply may be observed.
5.2) somewhere else ?
6) I've sketched the main framework of reading from snapshot in secondaries, which can be found here snapshot_read, which mainly fulfilles the points I listed above.

There must be some points that I've missed, and needs further discussions. I hope the official mongo guys can participate in.
Btw, I got some doubts when sketching out the framework, which is listed below, would someone be so kind to answer it?
Thanks https://groups.google.com/forum/#!topic/mongodb-dev/uVd6or43TO8

Generated at Thu Feb 08 03:53:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.