[SERVER-37416] OplogReplay query might need to scan extra oplog due to oplog visibility rules Created: 01/Oct/18  Updated: 27/Oct/23  Resolved: 08/Nov/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.8, 4.0.2, 4.1.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tess Avitabile (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Operating System: ALL
Steps To Reproduce:

When the collection has not been read from, we use a collection scan:

(function() {
    load("jstests/libs/analyze_plan.js");
 
    let t = db.getSiblingDB("local").oplog.jstests_query_oplogreplay;
    assert.commandWorked(t.getDB().createCollection(t.getName(), {capped: true, size: 16 * 1024}));
 
    for (let i = 1; i <= 10; i++) {
        assert.writeOK(t.insert({_id: i, ts: Timestamp(1000, i)}));
    }
 
    let res =
        assert.commandWorked(t.find({ts: {$gte: Timestamp(1000, 10), $lte: Timestamp(1000, 10)}})
                                 .addOption(DBQuery.Option.oplogReplay)
                                 .explain("executionStats"));
    assert.eq(res.executionStats.totalDocsExamined, 10, tojson(res));
}());

When the collection has been read from, we use the oplog start hack and only examine 1 document:

(function() {
    load("jstests/libs/analyze_plan.js");
 
    let t = db.getSiblingDB("local").oplog.jstests_query_oplogreplay;
    assert.commandWorked(t.getDB().createCollection(t.getName(), {capped: true, size: 16 * 1024}));
 
    for (let i = 1; i <= 10; i++) {
        assert.writeOK(t.insert({_id: i, ts: Timestamp(1000, i)}));
    }
    t.find().itcount();
    let res =
        assert.commandWorked(t.find({ts: {$gte: Timestamp(1000, 10), $lte: Timestamp(1000, 10)}})
                                 .addOption(DBQuery.Option.oplogReplay)
                                 .explain("executionStats"));
    assert.eq(res.executionStats.totalDocsExamined, 1, tojson(res));
}());

Participants:

 Description   

An OplogReplay query may perform a full collection scan if the collection has not been read from, but utilize the oplog start hack if the collection has been read from. This issue was introduced in 3.6 (it does not exist in 3.4). However, it does not appear to have been introduced by SERVER-29843, since the issue does not present at that time.



 Comments   
Comment by Eric Milkie [ 12/Nov/18 ]

After a bit of investigation, and with the help of Tess and David, I discovered some new insights:
1. This collection-scan behavior only can happen when the oplog is empty; if there are any durable oplog entries in the oplog, the OplogStartHack will always return a valid starting point and will not trigger a collection scan. In particular, the test in this ticket Description starts out with an empty oplog.
2. In light of this, the query for the required optime from the sync source resolver that Tess pointed out above is correctly coded, as we will never be presenting such a query to nodes with empty oplogs. There is no issue with the query, and it will never hit a collection scan on the oplog (unless the optime presented is truly not in the oplog and is earlier than any existing entry; this is normal behavior as well.)

Comment by Tess Avitabile (Inactive) [ 08/Nov/18 ]

milkie, what is the issue with the query in the example? We construct a query like that here.

Comment by Eric Milkie [ 08/Nov/18 ]

David's analysis is correct. In particular, I don't think the queries that have strange performance could be crafted by any external users, and internal users that craft such queries are doing something wrong, as I don't see any use cases for them.

Comment by David Storch [ 07/Nov/18 ]

I thought about this a bit more, and I'm reopening the ticket. Although there's no correctness issue here, it seems possible that this could present a performance problem. Specifically, it seems like this behavior could cause us to have to scan more oplog than necessary for oplogReplay queries. I'm guessing this doesn't matter much in practice, since I would expect our oplog visibility rules to only hide a small handful of recent writes. For example, suppose you have 1000 entries in the oplog, the last 10 of which are hidden. Furthermore, let's assume that the $gte predicate on the ts field in the oplogReplay query should give us the 5 most recent oplog entries. In theory we could start scanning from entry 995. However, this behavior may cause us to unnecessarily scan from 990 instead. Sending to the NYC Storage Team for triage.

Comment by Tess Avitabile (Inactive) [ 07/Nov/18 ]

Wow, that makes sense!

Comment by David Storch [ 07/Nov/18 ]

A few observations. First, this appears to have nothing to do with the intervening query. If I replace t.find().itcount() with sleep(100), this behavior still reproduces.

Second, the reported behavior is due to the storage subsystem, not due to the query layer. For both queries, the code in get_executor.cpp correctly attempts to use the "oplog start hack" in order to avoid scanning the whole capped collection. For the first query, however, our call into oplogStartHack() is returning the smallest possible RecordId, which instructs the query layer to scan from the beginning of the collection. Specifically, we're hitting the following branch in WiredTigerRecordStore::oplogStartHack():

https://github.com/mongodb/mongo/blob/675c01a6f151c892fe683fc6ff05496b3bed10bf/src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp#L1636

I'm pretty sure this happens due to oplog visibility rules. The oplogStartHack() doesn't see the writes to the oplog, so when attempting to find the first key in the oplog in the range [Timestamp(1000, 10), Timestamp(1000, 10)] it finds nothing. As a result, it returns a null record id. The CollectionScan stage is then configured to wait for oplog visibility and scan from the beginning of the collection. If you wait a little between issuing the writes and issuing the read, the writes become visible and the oplogReplay query is optimized in the way you would expect.

tess.avitabile I'm closing this ticket as "Works as Designed" since I don't think this is necessarily a bug. milkie, if you have a moment, it would be helpful if you could confirm the above (and perhaps help explain how our oplog visibility rules work!).

Comment by Charlie Swanson [ 26/Oct/18 ]

david.storch to investigate in coming sprint.

Generated at Thu Feb 08 04:45:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.