[SERVER-41320] Oplog getmores can return 0 entries Created: 25/May/19  Updated: 29/Oct/23  Resolved: 21/Sep/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.0

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Replication Team
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File mongod-4.1.13-w2-plot.png     PNG File mongod-4.1.13-w2-replsetupdatepos-plot.png     PNG File mongod-master-fix-w2-ftdc.png     PNG File mongod-master-fix-w2-plot.png    
Issue Links:
Depends
Related
Assigned Teams:
Replication
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Oplog getmores can return with 0 entries if there is nothing new in the oplog, rather than waiting for more entries to appear.

Under some circumstances this can increase the latency of w:majority or w:2 operations, because the operation must wait for the round-trip of the getmore that returned 0 as well as the round trip for its oplog entry before getting acknowledgement that the operation has been committed on the secondaries. This impacts for example single-threaded clients that are issuing repeated writes in a loop as fast as possible.



 Comments   
Comment by Bruce Lucas (Inactive) [ 21/Sep/20 ]

A quick test on a 3-node replica set doing 5 inserts per second shows the primary on 4.2.8 receiving about 20 getmores per second, whereas on 4.4.0 it is only receiving about 10 per second, so I think this ticket can be closed.

Comment by Steven Vannelli [ 24/Aug/20 ]

bruce.lucas we believe that this is resolved since we completed the Exhaust Cursors for Oplog Fetching project. Would it be easy for you to test to see if that is the case?

Comment by Asya Kamsky [ 20/Aug/20 ]

Flagging for scheduling in light of it being unblocked.

Comment by Asya Kamsky [ 20/Aug/20 ]

Now that exhaust project is complete, this should be unblocked, right?

Comment by Bernard Gorman [ 13/Jun/19 ]

judah.schvimer: no, I'm not proposing this as the definitive fix - I removed the block simply to confirm that this was the source of the double-getMore behaviour.

schwerin: I think you're right that replSetUpdatePosition is responsible for the premature getMore returns; on an otherwise idle replica set, a single write results in the next getMore returning 1 result to the Secondary, not an empty batch. Below is the original double-getMore graph, overlaid with replSetUpdatePosition events as vertical lines. Each of the empty getMores coincides with the arrival of a new replSetUpdatePosition command. The problem appears to be that the Secondary is issuing a getMore with a lastKnownCommittedOpTime of T, followed almost immediately by a replSetUpdatePosition of T+N that causes its own getMore to be cut short.

Comment by Andy Schwerin [ 13/Jun/19 ]

judah.schvimer, I don't think Bernard is proposing to get rid of the block. Rather, his experiment demonstrates that this block is the cause of the behavior. I don't think we want to depend on `replSetHeartbeat` to propagate commit point, or really do anything other than estimate ping times to hosts for purposes of making the spanning tree.

An interesting question is if his analysis is wholly correct. If we have an entirely idle replica set, and we do a single write, will the first getMore on the oplog after the single write return an empty batch or not? If not, then perhaps a `replSetUpdatePosition` command reliably comes in advancing the commit point in this workload.

If we switch to using exhaust cursors for transmitting oplog entries to secondaries, we might cut out half a network roundtrip, and maybe an entire one, without changing this block of code at all.

Comment by Judah Schvimer [ 13/Jun/19 ]

bernard.gorman, are you suggesting we actually remove the highlighted code? Basically remove the entire shouldWaitForInserts() function and just call _shouldListenForInserts()?

My concern with this change is if it means we propagate the commit point more slowly. We now propagate it via heartbeats though so I don't think this will have a significant impact on that. We only send heartbeats every 2 seconds though.

But it appears that the write we have just performed locally has caused the Primary's lastKnownCommittedOpTime to move forward before the corresponding entry is visible in the oplog.

I'm not sure how this is possible if both nodes are voting nodes (since the insert would have to be replicated to the secondary), and if they're not and there is only one voting node then this seems like an unusual edge case. bernard.gorman, can you please clarify what's happening here?

siyuan.zhou, what are your thoughts on this proposed change?

Comment by Bernard Gorman [ 13/Jun/19 ]

First of all, it's important to note that in general case, oplog getMores do wait for a maximum of 5 seconds before returning. This wait time does not show up in the logs, because in SERVER-29304 we deliberately changed our logging such that the time spent idling for results is not included in the total execution time of the getMore. So a getMore which waits for its entire 5-second duration without observing any new oplog entries will report its duration as 0ms, as will a getMore which arrives while there is a small number of entries in the oplog (since an awaitData cursor stops waiting as soon as it has seen at least one result). This isn't ideal for diagnostics, since it means that you have to look at the timestamps between getMores as well as their respective nreturned in order to figure out what's going on. Note that the total getMore duration is recorded in the FTDC data, including await time.

Below is a graph of w:2 inserts and oplog getMores on a 2-node replica set running 4.1.13, where this behaviour is still occurring. Dark blue circles are inserts, light blue circles are getMores which returned early with 0 results, pink circles are getMores which returned as expected with 1 result:

As bruce.lucas pointed out in HELP-9958, each insert causes an "extra" getMore which returns without picking up the insert operation from the oplog. Each of these is then followed by another getMore which does pick up the insert, allowing the w:2 write to return successfully to the client.

The cause of this double-getMore behaviour is this block of code, which was introduced in SERVER-29128 and modified slightly by SERVER-35239:

plan_executor_impl.cpp

405
bool PlanExecutorImpl::_shouldWaitForInserts() {
406
    // If this is an awaitData-respecting operation and we have time left and we're not interrupted,
407
    // we should wait for inserts.
408
    if (_shouldListenForInserts()) {
409
        // We expect awaitData cursors to be yielding.
410
        invariant(_yieldPolicy->canReleaseLocksDuringExecution());
411
 
412
        // For operations with a last committed opTime, we should not wait if the replication
413
        // coordinator's lastCommittedOpTime has progressed past the client's lastCommittedOpTime.
414
        // In that case, we will return early so that we can inform the client of the new
415
        // lastCommittedOpTime immediately.
416
        if (!clientsLastKnownCommittedOpTime(_opCtx).isNull()) {
417
            auto replCoord = repl::ReplicationCoordinator::get(_opCtx);
418
            return clientsLastKnownCommittedOpTime(_opCtx) >= replCoord->getLastCommittedOpTime();
419
        }
420
        return true;
421
    }
422
    return false;
423
}

This is called from PlanExecutorImpl::_getNextImpl in the case where we hit EOF, i.e. there are no oplog entries available. Each getMore includes a lastKnownCommittedOptime field, so we enter the if statement on line 416. But it appears that the write we have just performed locally has caused the Primary's lastKnownCommittedOpTime to move forward before the corresponding entry is visible in the oplog. We therefore always return false on line 418, causing the getMore to stop waiting and return EOF to the Secondary immediately.

Removing this block of code has the following effect on the w:2 test. Notice that there are no longer any light-blue nreturned:0 points on the graph, only pink nreturned:1 points.

From the FTDC data, we can also see that there is a 1:1 ratio of getMores to inserts, rather than the 2:1 ratio which bruce.lucas observed in HELP-9958. Every getMore now waits until the insert's oplog entry is visible, and then immediately returns it to the Secondary without any extraneous round-trips.

Passing this over to the Replication team for triage. cc tess.avitabile judah.schvimer

Comment by Charlie Swanson [ 30/May/19 ]

bernard.gorman your next sprint seems pretty free so I'm giving you this to take a look. It looks like anton.korshunov might also be pretty free after finishing up $merge tickets.

Generated at Thu Feb 08 04:57:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.