[SERVER-41291] Oldest timestamp not always advanced with --enableMajorityReadConcern=false, on secondary nodes Created: 23/May/19  Updated: 29/Oct/23  Resolved: 24/May/19

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 4.0.5, 4.0.9
Fix Version/s: 4.0.10

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Eric Milkie
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File oldest.png    
Issue Links:
Depends
Related
is related to SERVER-33846 Alternative for setting oplog read ti... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2019-06-03
Participants:
Case:

 Description   

Run the following against a replica set with majority read concern disabled. It generates bursts of updates spaced 15 seconds apart.

function update() {
    for (var i = 0; i < 1000; i++)
        db.c.insert({x: 0})
    for (var x = 0; ; x++) {
        db.c.updateMany({}, {$set: {x: x}})
        if (x%20 == 1)
            sleep(15000)
    }
}

FTDC shows the following:

During the bursts we are executing a couple hundred batches per second, but the oldest timestamp is not advanced on every batch, and in fact may not be advanced for minutes at a time. This can cause a large amount of data to be pinned in cache.



 Comments   
Comment by Githook User [ 24/May/19 ]

Author:

{'name': 'Eric Milkie', 'email': 'milkie@10gen.com', 'username': 'milkie'}

Message: SERVER-41291 ensure advancing of oldest timestamp in enableMajorityReadConcern=false mode on secondary nodes
Branch: v4.0
https://github.com/mongodb/mongo/commit/c389e7f69f637f7a1ac3cc9fae843b635f20b766

Comment by Eric Milkie [ 24/May/19 ]

SERVER-37897 effectively changed the code here so that the bug is no longer present in 4.1.

Comment by Eric Milkie [ 24/May/19 ]

The problem appears to be a race between setting the oplog read timestamp in oplogDiskLocRegister and the one in the WTOplogJournalThread (oplog visibility thread) here.
The code in the loop assumes that it is the only thing moving the oplog read timestamp forward.

Comment by Bruce Lucas (Inactive) [ 23/May/19 ]

By the way, a note on running the repro: the chart above is from a 2-node replica set. The behavior reproduces on a 3-node replica set as well, but it seems to be a bit more complicated - maybe one secondary exhibits the behavior more than the other, and maybe they tended not to both exhibit it all the time. So from an analysis standpoint it might be simpler to start with a 2-node replica set.

Comment by Bruce Lucas (Inactive) [ 23/May/19 ]

This does not seem to reproduce on 4.1.11 nor 3.6.12 (for this particular reproducer).

Generated at Thu Feb 08 04:57:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.