[SERVER-34606] Test (and possibly fix) behavior around majority commit point and oplog truncation Created: 23/Apr/18  Updated: 29/Oct/23  Resolved: 22/Jun/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: None
Fix Version/s: 4.0.3, 4.1.1

Type: Improvement Priority: Major - P3
Reporter: Ian Whalen (Inactive) Assignee: Maria van Keulen
Resolution: Fixed Votes: 0
Labels: SWNA, nyc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-35747 Check supportsRecoverToStableTimestam... Closed
is related to TOOLS-1993 TOOLS qa-tests failing on server unst... Closed
is related to TOOLS-2027 mongostat qa-tests failing on server ... Closed
is related to SERVER-29213 Have KVWiredTigerEngine implement Sto... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0
Sprint: Storage NYC 2018-06-04, Storage NYC 2018-06-18, Storage NYC 2018-07-02
Participants:

 Description   

First step here is that we need to add a js test for the behavior when replication majority commit point stops the oplog being truncated.

Depending on what that test turns up, there might be follow-on work to improve that behavior.



 Comments   
Comment by Githook User [ 07/Sep/18 ]

Author:

{'name': 'Maria van Keulen', 'email': 'maria@mongodb.com', 'username': 'mvankeulen94'}

Message: SERVER-34606 Early return from majority commit point oplog truncation

(cherry picked from commit c1803e01a3827072b7dcd962a864c62a426824b6)

SERVER-35747 Don't check for timestamps on non timestamp supported SEs

(cherry picked from commit b7ff5816f4d9d468b1875013384e7e51184628a0)
Branch: v4.0
https://github.com/mongodb/mongo/commit/263055e76e41fc4c65dc3fbcaa59ec3a7eedbdcc

Comment by Githook User [ 21/Jun/18 ]

Author:

{'username': 'mvankeulen94', 'name': 'Maria van Keulen', 'email': 'maria@mongodb.com'}

Message: SERVER-34606 Early return from majority commit point oplog truncation
Branch: master
https://github.com/mongodb/mongo/commit/c1803e01a3827072b7dcd962a864c62a426824b6

Comment by Alexander Gorrod [ 14/May/18 ]

Does the counter for failed truncates count the number of busy spins or just the fact that it failed and then we started spinning?

I think it should count the number of busy spins.

May be a moot point of we eliminate this behavior.

I would hope so - i.e: we can update the algorithm to not busy spin.

Comment by Bruce Lucas (Inactive) [ 11/May/18 ]

alexander.gorrod a counter for failed truncates makes sense.

Does it also make sense to have a counter for "stopped oplog reclaim happening if it would remove content that is older than the majority commit point."?

Does the counter for failed truncates count the number of busy spins or just the fact that it failed and then we started spinning? May be a moot point of we eliminate this behavior.

Comment by Ian Whalen (Inactive) [ 11/May/18 ]

This line of code gets executed way more than once every time we insert that attempts to truncate the oplog, and we expect this execution count to go down drastically if we fix this ticket.

Comment by Alexander Gorrod [ 09/May/18 ]

bruce.lucas I believe you can construct the scenario where 1 is happening based on oplog size growing in excess of oplog maxSize and transaction transaction range of timestamps currently pinned being large and growing.

Regards 2 - I don't believe there is any tracking. It would make sense to add a server status entry for failed oplog truncate attempts.

Comment by Bruce Lucas (Inactive) [ 09/May/18 ]

alexander.gorrod, do we have ftdc metrics that tell us whether your items 1 and 2 are occurring?

Comment by Alexander Gorrod [ 09/May/18 ]

For additional context, there was a change made as part of SERVER-29213 that stopped oplog reclaim happening if it would remove content that is older than the majority commit point.

That is a change with user visible consequences - as some internal testing has uncovered. There are two potential behavior differences now:
1) The oplog may grow above the configured maximum size when either the oplog is small or the majority commit point falls behind.
2) There is a utility thread that reclaims space from the oplog - that thread will now potentially enter a busy spin attempting to reclaim space from the oplog. Doing so may introduce performance issues.

The goal of this ticket is to characterize the user-visible changes, and to add a test to automated testing which tests the new behavior and ensures it is reasonable (yet to be defined).

Generated at Thu Feb 08 04:37:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.