[SERVER-32606] Tailing oplog on secondary fails with CappedPositionLost Created: 09/Jan/18  Updated: 30/Oct/23  Resolved: 18/Jan/18

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 3.6.0, 3.6.1
Fix Version/s: 3.6.3, 3.7.2

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Geert Bosch
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js     Text File server32606_debug.patch    
Issue Links:
Backports
Duplicate
is duplicated by SERVER-32493 PlanExecutor killed: CappedPositionLo... Closed
Related
related to SERVER-33042 Add test coverage for tailing oplog o... Closed
is related to SERVER-32883 Enhanced FSM testing for reading from... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Storage 2018-01-29
Participants:

 Description   

function repro() {
 
    uri="mongodb://localhost:27017/test?replicaSet=rs"
 
    # tail the oplog on the secondary
    mongo $uri --eval '
        var ol = db.getSiblingDB("local").oplog.rs
        var c = ol.find({}).readPref("secondaryPreferred").tailable({isAwaitData: true})
        var i = 0
        while (true) {
            c.forEach(function (o) {
                print(i++, o.wall)
            })
        }
    ' &
 
    sleep 5
 
    # generate some traffic
    mongo $uri --eval '
        while (true)
            db.c.insert({})
    '
}

Fails reliably with

2018-01-09T15:56:56.703-0500 E QUERY    [thread1] Error: getMore command failed: {
	"ok" : 0,
	"errmsg" : "PlanExecutor killed: CappedPositionLost: CollectionScan died due to failure to restore tailable cursor position. Last seen record id: RecordId(6509157867780571265)",
	"code" : 175,
	"codeName" : "QueryPlanKilled",
	"operationTime" : Timestamp(1515531416, 129),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1515531416, 131),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
} :

Does not fail on the primary.



 Comments   
Comment by Bruce Lucas (Inactive) [ 26/Jan/18 ]

My recollection is that the repro script that I provided failed reliably and quickly. I don't know if this is due to a difference in environment or a difference in the repro. One difference I do see in the repros is that mine started tailing the oplog before starting to generate traffic; maybe this makes it fail faster?

Comment by Geert Bosch [ 25/Jan/18 ]

The repro script is neither suitable nor easily adaptable for inclusion in Evergreen. The reason is that the time until failure is quite variable and requires an amount of write traffic that would be excessive for automated testing. Similarly, there are often spurious failures due to oplog rollover. Finally, because failure is non-deterministic, it would be likely for the test to silently stop testing its actual objective.

Given the amount of work needed to write a suitable test for this, I opened SERVER-32883 for a more holistic approach.

Comment by Spencer Brody (Inactive) [ 24/Jan/18 ]

geert.bosch, this ticket has a javascript repro attached to it, yet the fix went in with no test coverage. I think this ticket should be re-opened until a test is added.

Comment by Githook User [ 18/Jan/18 ]

Author:

{'name': 'Geert Bosch', 'email': 'geert@mongodb.com', 'username': 'GeertBosch'}

Message: SERVER-32606 Oplog read time may only decrease during rollback

(cherry picked from commit 3020767b4a20fca53d4b58b440085933c8eb7c3e)
Branch: v3.6
https://github.com/mongodb/mongo/commit/32f9d722024d2b4821895482b1ebe48b1d6d71bc

Comment by Githook User [ 18/Jan/18 ]

Author:

{'name': 'Geert Bosch', 'email': 'geert@mongodb.com', 'username': 'GeertBosch'}

Message: SERVER-32606 Oplog read time may only decrease during rollback
Branch: master
https://github.com/mongodb/mongo/commit/3020767b4a20fca53d4b58b440085933c8eb7c3e

Comment by Eric Milkie [ 17/Jan/18 ]

Indeed, read concern majority will prevent this from happening because it uses a different method to choose the read timestamp for each transaction.
Chained replication is affected by this issue, and may result in slower replication to chained nodes.

Comment by Charlie Swanson [ 17/Jan/18 ]

bruce.lucas The reproduction script(s) will not reproduce the issue when the tailable cursor is replaced with a change stream. pasette points out that this is likely because a change stream will use read concern "majority", so may not be subject to this problem because of that. I'm trying to reproduce if there are 3 nodes in the replica set, but no luck so far. Can anyone confirm if read concern "majority" would prevent this from happening? Or perhaps it just makes it less likely?

I'm not familiar enough with chained replication to answer the second part of your question, sorry!

Comment by Eric Milkie [ 17/Jan/18 ]

I believe Michael's quick fix is on the right track but there is a subtle problem with it that we still need to iron out. Geert and I are going to devise a solution.

Comment by Bruce Lucas (Inactive) [ 17/Jan/18 ]

Does this impact chained replication and change streams?

Comment by Davis Ford [ 17/Jan/18 ]

Thank you Mr. Cahill. I would love the quick fix pushed in the next minor release (hint).

I'll take a patch, too.

Comment by Michael Cahill (Inactive) [ 17/Jan/18 ]

I think I see what's going on here.

WiredTiger has a query that returns the "all_committed" timestamp. MongoDB queries that periodically and uses the last returned value for oplog reads. The all_committed timestamp is calculated as:

min(commit_timestamp(T) forall active transactions T,
    max(commit_timestamp(T) forall committed transactions T)

(Approximately, if there are no active transactions, it gives the largest commit timestamp seen. If there are active transactions, it gives the smallest commit timestamp in flight).

On primaries, the commit timestamp of each transaction is set under a lock, and commit timestamps are not reused, so this calculation will always move forward. That's what we had in mind when the interface was designed.

On secondaries, that is not the case: commit timestamps are set both for oplog writes and again, on separate transactions, when applying the oplog. Since WT doesn't distinguish between those two different types of transactions in this calculation, the calculated "all_committed" timestamp can go backwards. The oplog is written in a batch, then applies start reusing timestamps from the beginning of the batch.

A quick fix is to have WiredTigerOplogManager::_setOplogReadTimestamp only update _oplogReadTimestamp if the new all_committed timestamp is larger than the last one. A more robust fix would be to have secondaries set their _oplogReadTimestamp explicitly as each batch of oplog writes completes, and to not rely on WT's all_committed calculation on secondaries.

Quick fix, for the record:

 void WiredTigerOplogManager::_setOplogReadTimestamp(WithLock, uint64_t newTimestamp) {
+    if (newTimestamp < _oplogReadTimestamp.load()) {
+        LOG(0) << "avoiding moving oplog read timestamp backwards from: " << _oplogReadTimestamp.load()
+               << " to: " << newTimestamp;
+        return;
+    }
+

Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ]

Now that I've normalized the timestamps, here's another run:

d20001| 2018-01-16T21:59:09.784-0500 I STORAGE  [conn1] Beginning oplog transaction. Size: 31 ReadSize: 32 TS: Timestamp(1516157949, 1756)
d20001| 2018-01-16T21:59:09.784-0500 I REPL     [rsSync] Start batch ts: Timestamp(1516157949, 1774)
d20001| 2018-01-16T21:59:09.785-0500 I REPL     [rsSync] Finish batch ts: Timestamp(1516157949, 1781)
d20001| 2018-01-16T21:59:09.785-0500 I REPL     [rsSync] Successfully finished. TS: { ts: Timestamp(1516157949, 1781), t: 1 }
d20001| 2018-01-16T21:59:09.785-0500 I STORAGE  [WTOplogJournalThread] setting new oplogReadTimestamp: Timestamp(1516157949, 1751)
d20001| 2018-01-16T21:59:09.785-0500 I STORAGE  [conn1] Beginning oplog transaction. Size: 31 ReadSize: 32 TS: Timestamp(1516157949, 1751)
d20001| 2018-01-16T21:59:09.785-0500 I STORAGE  [conn1] Seek did not find a record. Id: Timestamp(1516157949, 1756)

I'm also attaching my server patch that adds some debugging. It's applied on top of 25f1ad63b3f24e56df51d811e7722a1cb2dd7433.

Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ]

I'm seeing the following between the oplog visibility and batch application:

  1. Batch application completes a batch, say from times 10 -> 100
  2. The oplog visibility is set to 100
  3. Oplog 100 is returned to the client
  4. Batch application starts on range 101 -> 200
  5. Oplog visibility is set back to 95. I can't yet explain this. This could be WiredTiger....or the primary could be feeding results to the secondary out of order (adding an invariant).
  6. Batch application completes
  7. The client comes back with a getMore. It's transaction is started at read timestamp 95.
  8. The seek fails, read timestamp 95 is not sufficient to seek to oplog entry 100.
  9. The log line here is printed on the same client connection. I'm not sure what this is. The wait for and latest visible are 125 and 101.
  10. Visibility is bumped to 125, unblocking the reader
  11. A transaction is opened at 125. The client receives the error for not finding entry 100.
Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ]

Note, I had more luck reproducing with the following diff:

dgottlieb@chimichurri ~/xgen/mongo[v3.6]$ diff -u server32606_repro.js server32606_repro_mine.js 
--- server32606_repro.js	2018-01-16 20:44:40.499167637 -0500
+++ server32606_repro_mine.js	2018-01-16 20:44:37.011137184 -0500
@@ -4,6 +4,8 @@
     const rst = new ReplSetTest({name: "repro", nodes: 2});
     rst.startSet();
     rst.initiate();
+    rst.getPrimary().getDB("starter").c.insert({});
+    rst.awaitReplication();
 
     sleep(5);
     // generate some traffic

Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ]

charlie.swanson's hunch is in the right ballpark. It seems the oplog read timestamp goes backwards right before the failure (can ignore the "size" values):

d20001| 2018-01-16T20:32:30.035-0500 I STORAGE  [conn1] Beginning oplog transaction. Size: 31 ReadSize: 32 TS: 6511826476990464049
d20001| 2018-01-16T20:32:30.035-0500 I STORAGE  [conn1] Beginning oplog transaction. Size: 31 ReadSize: 32 TS: 6511826476990464043

Not a full diagnosis, but for steady state replication, in pv1, definitely an invariant violation.

The read visibility going backwards might be a red herring. We only pin the oldest timestamp (and thus visibility) when read concern majority is disabled.

Because the oplog manager thread (updates visibility) does not acquire the PB lock, it can query WT when a later write in a batch as completed, but before some earlier one has started. I was thinking if a batch had an error, it could leave the visibility in an incorrectly advanced state and also have released the PB lock. I think that still might be an issue, but I don't believe that's what we're seeing here. I'm not seeing an error in oplog application.

I'm still suspicious the oplog read timestamp going backwards is correlated. When I look at all the changes to the read timestamp, it only goes backwards once, right around the time of failure. The later value that's getting temporarily "reverted" is typically (always?) the timestamp that's showing up in the "lost position" error message.

When I enable pinning the oldest timestamp for secondary oplog application, the failure goes away; until the secondary falls off the back of the primary's oplog and crashes. On my machine that failure case was taking ~45 seconds while this ticket's repro is typically ~20-30 seconds.

Comment by Charlie Swanson [ 16/Jan/18 ]

Results from investigation so far:

I've created the attached repro.js that will reproduce the failure at least most of the time. I've also added some logging, and found that this is the status of the cursor when we get into this error case:

  • It is pretty much up to date with the oplog. That is, it seems to be "keeping up".
  • It fails to resume it's cursor in the record store - this call fails in the collection scan.
  • When that call fails, it's looking for a recordId that is in between the one returned by these two queries:

    const oplog = db.getSiblingDB("local).oplog.rs;
    let oldestRecordId = oplog.find({}, {recordId: {$meta: "recordId"}})
                             .sort({$natural: 1})
                             .limit(1)
                             .toArray()[0]
                             .recordId;
    let newestRecordId = oplog.find({}, {recordId: {$meta: "recordId"}})
                             .sort({$natural: -1})
                             .limit(1)
                             .toArray()[0]
                             .recordId;
    

    For example, in one failure we failed to find the record with ID 6511757774693598360, when the oldest was 6511757748923795141 and the newest was 6511757770398631158.

  • It looks like the oplog is indeed "rolling over" around the time of the failure, since the timestamp of the oldest entry is stable for a period, and different after the error is thrown.

I'm going to keep investigating.

Comment by Charlie Swanson [ 16/Jan/18 ]

Looking into it. Repro works for me on 3.6, didn't work on 3.4, but I only tried once. I've converted it into a jstest that sets up the replica set for me, looking into diagnosing now, adding logging, etc. I'll post any updates when I have them.

Generated at Thu Feb 08 04:30:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.