[SERVER-37416] OplogReplay query might need to scan extra oplog due to oplog visibility rules Created: 01/Oct/18 Updated: 27/Oct/23 Resolved: 08/Nov/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.6.8, 4.0.2, 4.1.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Tess Avitabile (Inactive) | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Storage Execution
|
||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||
| Steps To Reproduce: | When the collection has not been read from, we use a collection scan:
When the collection has been read from, we use the oplog start hack and only examine 1 document:
|
||||||||||||||||||||||||||||||||
| Participants: |
| Description |
|
An OplogReplay query may perform a full collection scan if the collection has not been read from, but utilize the oplog start hack if the collection has been read from. This issue was introduced in 3.6 (it does not exist in 3.4). However, it does not appear to have been introduced by |
| Comments |
| Comment by Eric Milkie [ 12/Nov/18 ] |
|
After a bit of investigation, and with the help of Tess and David, I discovered some new insights: |
| Comment by Tess Avitabile (Inactive) [ 08/Nov/18 ] |
|
milkie, what is the issue with the query in the example? We construct a query like that here. |
| Comment by Eric Milkie [ 08/Nov/18 ] |
|
David's analysis is correct. In particular, I don't think the queries that have strange performance could be crafted by any external users, and internal users that craft such queries are doing something wrong, as I don't see any use cases for them. |
| Comment by David Storch [ 07/Nov/18 ] |
|
I thought about this a bit more, and I'm reopening the ticket. Although there's no correctness issue here, it seems possible that this could present a performance problem. Specifically, it seems like this behavior could cause us to have to scan more oplog than necessary for oplogReplay queries. I'm guessing this doesn't matter much in practice, since I would expect our oplog visibility rules to only hide a small handful of recent writes. For example, suppose you have 1000 entries in the oplog, the last 10 of which are hidden. Furthermore, let's assume that the $gte predicate on the ts field in the oplogReplay query should give us the 5 most recent oplog entries. In theory we could start scanning from entry 995. However, this behavior may cause us to unnecessarily scan from 990 instead. Sending to the NYC Storage Team for triage. |
| Comment by Tess Avitabile (Inactive) [ 07/Nov/18 ] |
|
Wow, that makes sense! |
| Comment by David Storch [ 07/Nov/18 ] |
|
A few observations. First, this appears to have nothing to do with the intervening query. If I replace t.find().itcount() with sleep(100), this behavior still reproduces. Second, the reported behavior is due to the storage subsystem, not due to the query layer. For both queries, the code in get_executor.cpp correctly attempts to use the "oplog start hack" in order to avoid scanning the whole capped collection. For the first query, however, our call into oplogStartHack() is returning the smallest possible RecordId, which instructs the query layer to scan from the beginning of the collection. Specifically, we're hitting the following branch in WiredTigerRecordStore::oplogStartHack(): I'm pretty sure this happens due to oplog visibility rules. The oplogStartHack() doesn't see the writes to the oplog, so when attempting to find the first key in the oplog in the range [Timestamp(1000, 10), Timestamp(1000, 10)] it finds nothing. As a result, it returns a null record id. The CollectionScan stage is then configured to wait for oplog visibility and scan from the beginning of the collection. If you wait a little between issuing the writes and issuing the read, the writes become visible and the oplogReplay query is optimized in the way you would expect. tess.avitabile I'm closing this ticket as "Works as Designed" since I don't think this is necessarily a bug. milkie, if you have a moment, it would be helpful if you could confirm the above (and perhaps help explain how our oplog visibility rules work!). |
| Comment by Charlie Swanson [ 26/Oct/18 ] |
|
david.storch to investigate in coming sprint. |