[SERVER-32606] Tailing oplog on secondary fails with CappedPositionLost Created: 09/Jan/18 Updated: 30/Oct/23 Resolved: 18/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 3.6.0, 3.6.1 |
| Fix Version/s: | 3.6.3, 3.7.2 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Geert Bosch |
| Resolution: | Fixed | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||||||
| Sprint: | Storage 2018-01-29 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
Fails reliably with
Does not fail on the primary. |
| Comments |
| Comment by Bruce Lucas (Inactive) [ 26/Jan/18 ] | ||||||||||||
|
My recollection is that the repro script that I provided failed reliably and quickly. I don't know if this is due to a difference in environment or a difference in the repro. One difference I do see in the repros is that mine started tailing the oplog before starting to generate traffic; maybe this makes it fail faster? | ||||||||||||
| Comment by Geert Bosch [ 25/Jan/18 ] | ||||||||||||
|
The repro script is neither suitable nor easily adaptable for inclusion in Evergreen. The reason is that the time until failure is quite variable and requires an amount of write traffic that would be excessive for automated testing. Similarly, there are often spurious failures due to oplog rollover. Finally, because failure is non-deterministic, it would be likely for the test to silently stop testing its actual objective. Given the amount of work needed to write a suitable test for this, I opened | ||||||||||||
| Comment by Spencer Brody (Inactive) [ 24/Jan/18 ] | ||||||||||||
|
geert.bosch, this ticket has a javascript repro attached to it, yet the fix went in with no test coverage. I think this ticket should be re-opened until a test is added. | ||||||||||||
| Comment by Githook User [ 18/Jan/18 ] | ||||||||||||
|
Author: {'name': 'Geert Bosch', 'email': 'geert@mongodb.com', 'username': 'GeertBosch'}Message: (cherry picked from commit 3020767b4a20fca53d4b58b440085933c8eb7c3e) | ||||||||||||
| Comment by Githook User [ 18/Jan/18 ] | ||||||||||||
|
Author: {'name': 'Geert Bosch', 'email': 'geert@mongodb.com', 'username': 'GeertBosch'}Message: | ||||||||||||
| Comment by Eric Milkie [ 17/Jan/18 ] | ||||||||||||
|
Indeed, read concern majority will prevent this from happening because it uses a different method to choose the read timestamp for each transaction. | ||||||||||||
| Comment by Charlie Swanson [ 17/Jan/18 ] | ||||||||||||
|
bruce.lucas The reproduction script(s) will not reproduce the issue when the tailable cursor is replaced with a change stream. pasette points out that this is likely because a change stream will use read concern "majority", so may not be subject to this problem because of that. I'm trying to reproduce if there are 3 nodes in the replica set, but no luck so far. Can anyone confirm if read concern "majority" would prevent this from happening? Or perhaps it just makes it less likely? I'm not familiar enough with chained replication to answer the second part of your question, sorry! | ||||||||||||
| Comment by Eric Milkie [ 17/Jan/18 ] | ||||||||||||
|
I believe Michael's quick fix is on the right track but there is a subtle problem with it that we still need to iron out. Geert and I are going to devise a solution. | ||||||||||||
| Comment by Bruce Lucas (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
Does this impact chained replication and change streams? | ||||||||||||
| Comment by Davis Ford [ 17/Jan/18 ] | ||||||||||||
|
Thank you Mr. Cahill. I would love the quick fix pushed in the next minor release (hint). I'll take a patch, too. | ||||||||||||
| Comment by Michael Cahill (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
I think I see what's going on here. WiredTiger has a query that returns the "all_committed" timestamp. MongoDB queries that periodically and uses the last returned value for oplog reads. The all_committed timestamp is calculated as:
(Approximately, if there are no active transactions, it gives the largest commit timestamp seen. If there are active transactions, it gives the smallest commit timestamp in flight). On primaries, the commit timestamp of each transaction is set under a lock, and commit timestamps are not reused, so this calculation will always move forward. That's what we had in mind when the interface was designed. On secondaries, that is not the case: commit timestamps are set both for oplog writes and again, on separate transactions, when applying the oplog. Since WT doesn't distinguish between those two different types of transactions in this calculation, the calculated "all_committed" timestamp can go backwards. The oplog is written in a batch, then applies start reusing timestamps from the beginning of the batch. A quick fix is to have WiredTigerOplogManager::_setOplogReadTimestamp only update _oplogReadTimestamp if the new all_committed timestamp is larger than the last one. A more robust fix would be to have secondaries set their _oplogReadTimestamp explicitly as each batch of oplog writes completes, and to not rely on WT's all_committed calculation on secondaries. Quick fix, for the record:
| ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
Now that I've normalized the timestamps, here's another run:
I'm also attaching my server patch that adds some debugging. It's applied on top of 25f1ad63b3f24e56df51d811e7722a1cb2dd7433. | ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
I'm seeing the following between the oplog visibility and batch application:
| ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
Note, I had more luck reproducing with the following diff:
| ||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 17/Jan/18 ] | ||||||||||||
|
charlie.swanson's hunch is in the right ballpark. It seems the oplog read timestamp goes backwards right before the failure (can ignore the "size" values):
The read visibility going backwards might be a red herring. We only pin the oldest timestamp (and thus visibility) when read concern majority is disabled. Because the oplog manager thread (updates visibility) does not acquire the PB lock, it can query WT when a later write in a batch as completed, but before some earlier one has started. I was thinking if a batch had an error, it could leave the visibility in an incorrectly advanced state and also have released the PB lock. I think that still might be an issue, but I don't believe that's what we're seeing here. I'm not seeing an error in oplog application. I'm still suspicious the oplog read timestamp going backwards is correlated. When I look at all the changes to the read timestamp, it only goes backwards once, right around the time of failure. The later value that's getting temporarily "reverted" is typically (always?) the timestamp that's showing up in the "lost position" error message. When I enable pinning the oldest timestamp for secondary oplog application, the failure goes away; until the secondary falls off the back of the primary's oplog and crashes. On my machine that failure case was taking ~45 seconds while this ticket's repro is typically ~20-30 seconds. | ||||||||||||
| Comment by Charlie Swanson [ 16/Jan/18 ] | ||||||||||||
|
Results from investigation so far: I've created the attached repro.js
I'm going to keep investigating. | ||||||||||||
| Comment by Charlie Swanson [ 16/Jan/18 ] | ||||||||||||
|
Looking into it. Repro works for me on 3.6, didn't work on 3.4, but I only tried once. I've converted it into a jstest that sets up the replica set for me, looking into diagnosing now, adding logging, etc. I'll post any updates when I have them. |