[SERVER-35941] Don't maintain full stable optime candidate list on secondaries in PV0 Created: 02/Jul/18  Updated: 29/Oct/23  Resolved: 22/Aug/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.6.5
Fix Version/s: 3.6.8

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-35642 We're seeing increasing memory usage ... Closed
Related
related to SERVER-27124 Disallow readConcern:majority reads o... Closed
related to SERVER-32637 Ensure that upgrading to 3.6 when on ... Closed
is related to SERVER-42243 Ban the combination of PV0 and enable... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2018-08-27
Participants:
Case:

 Description   

In PV0, we do not propagate the necessary information to secondaries for them to advance their commit point. We still, however, keep around a list of stable optime candidates that we add to every time we update our lastApplied optime. Normally, we can purge timestamps from this list that are earlier than the stable timestamp (which is the latest timestamp in this list earlier than the commit point), since we don't need them any more. If the commit point never moves on secondaries, though, this list will never get purged, and it will grow without bound.

To avoid keeping around an unbounded amount of storage engine history in PV0, we already manually advance the storage engine's stable timestamp to whatever the lastApplied timestamp is. We should do something similar for the stable optime candidate list. For PV0 secondaries it should likely be sufficient to keep no stable optime candidates in this list. When we become a primary, we can then start adding optimes to this list, and purging them appropriately since we advance the commit point as a primary. Upon protocol version upgrade from PV0 => PV1, we will start without any stable optime candidates. We can then start adding optime candidates when we start applying writes in PV1, and set a new stable timestamp as soon as we learn of a commit point later than one of our candidates.

This only applies to 3.6, which is where we first added the "stable optime candidate" list. Protocol version 0 is banned in versions >= 4.0 as of SERVER-33759.



 Comments   
Comment by Githook User [ 22/Aug/18 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: SERVER-35941 Don't maintain full stable optime candidate list on secondaries in PV0
Branch: v3.6
https://github.com/mongodb/mongo/commit/7db635ad05cef58ebe6a4a8ec4418cb1c97951a5

Comment by Alyson Cabral (Inactive) [ 07/Aug/18 ]

I agree that we should do it, I also didn't realize this would impact everyone with the PV0/3.6 combination, but it's not super urgent. PV0 is deprecated in 3.6 and the clear fix is to upgrade to PV1.

Comment by Tess Avitabile (Inactive) [ 07/Aug/18 ]

Thanks for calling attention back to this. I had missed the fact that this will always happen in a 3.6 PV0 set. If we do this work, then users would still need to upgrade minor versions to address the problem, which might not be easier than upgrading protocol version. But at least then future upgrades to 3.6 would have the fix, so it's probably a good idea to do this work. alyson.cabral, what do you think?

Comment by William Schultz (Inactive) [ 06/Aug/18 ]

tess.avitabile Just to double check, you are ok with the resolution of this ticket as "won't fix"? spencer pointed out that it effectively makes 3.6 + PV0 unusable, since this bug causes memory usage to grow without bound on PV0 secondaries.

Comment by Gregory McKeon (Inactive) [ 05/Jul/18 ]

The solution is to upgrade from PV0 to PV1, which is required in 3.6 to upgrade to 4.0.

Generated at Thu Feb 08 04:41:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.