[SERVER-36494] Prevent oplog truncation of oplog entries needed for startup recovery Created: 07/Aug/18  Updated: 29/Oct/23  Resolved: 08/Apr/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.10

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: prepare_durability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-36811 Provide a mechanism for replication t... Closed
depends on SERVER-39679 Add callback to replication when stor... Closed
is depended on by SERVER-40457 test that secondaries roll over their... Closed
Related
related to SERVER-39627 Audit uses of TransactionParticipant:... Closed
related to SERVER-36772 Ensure oplog cannot be truncated due ... Closed
related to SERVER-38165 Test that prepared transactions work ... Closed
related to SERVER-39680 Maintain the oldest active transactio... Closed
related to SERVER-39989 Use a config.transactions find comman... Closed
related to SERVER-30460 Grow the oplog when the replication c... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-11-19, Repl 2019-02-11, Repl 2019-02-25, Repl 2019-03-11, Repl 2019-03-25, Service Arch 2019-04-22, Repl 2019-04-08
Participants:
Linked BF Score: 0

 Comments   
Comment by Nate Brennand [ 11/Aug/21 ]

No worries! Appreciate you getting back to me regardless.

Comment by A. Jesse Jiryu Davis [ 11/Aug/21 ]

I'm sorry Nate, I've utterly forgotten the details of this patch and I can't guess whether it's backportable.

Comment by Nate Brennand [ 11/Aug/21 ]

Hi Jesse,

It’s been a while since this commit went out but we were curious if the change to the replSetReconfig locking was made possible by this change or others in the 4.1 release?
Or was this just a long standing setting that was overly strict?

We have diagnosed a latency increase which we think is due to the need for an X lock by replSetReconfig and are evaluating if we can backport the locking change portion of your patch to our v3.6 fork.
https://github.com/mongodb/mongo/commit/f8f872e029ba3b1f32d8499c912756d48dc1a03b#diff-d435e02c922fe2ccfa8a05ab1a09a93a6519babc71b15472bca3c9eb7e6e2b4fR386

 

Comment by A. Jesse Jiryu Davis [ 08/Apr/19 ]

Had to revert due to a merge mistake, fixed it and pushed again.

Comment by Githook User [ 08/Apr/19 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: SERVER-36494 Test that active txn entries aren't truncated

Add tests for initial sync, recovery, and the inMemory storage engine.

Also, avoid taking a global X lock in replSetReconfig, we only need IX.
Branch: master
https://github.com/mongodb/mongo/commit/4cd74465dc857148e897654d31195226ef665e70

Comment by Githook User [ 08/Apr/19 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: Revert "SERVER-36494 Test that active txn entries aren't truncated"

This reverts commit f8f872e029ba3b1f32d8499c912756d48dc1a03b.
Branch: master
https://github.com/mongodb/mongo/commit/02a87ee5b1942d24e1d7a20502c79d36218929fe

Comment by A. Jesse Jiryu Davis [ 08/Apr/19 ]

My new test recovery_preserves_active_txns.js fails on master; reverting and investigating.

Comment by Githook User [ 08/Apr/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: SERVER-36494 Test that active txn entries aren't truncated

Add tests for initial sync, recovery, and the inMemory storage engine.

Also, avoid taking a global X lock in replSetReconfig, we only need IX.
Branch: master
https://github.com/mongodb/mongo/commit/f8f872e029ba3b1f32d8499c912756d48dc1a03b

Comment by A. Jesse Jiryu Davis [ 03/Apr/19 ]

Acknowledged.

Comment by Judah Schvimer [ 03/Apr/19 ]

jesse, I've filed SERVER-40457 for the remainder of the work. PTAL that it's complete and feel free to close this when your CR is pushed.

Comment by A. Jesse Jiryu Davis [ 01/Apr/19 ]

After my next CR, the remaining work is to test in general that secondaries roll over their oplogs when they exceed oplogSize. Could add an assert to the bottom of initial_sync_oplog_rollover.js that checks the secondary has deleted the first oplog entry by the time the test is ending.

Comment by Judah Schvimer [ 27/Mar/19 ]

jesse, can you please comment on if there is further work (or further investigation to determine if further work is necessary) to complete so that necessary oplog entries are not truncated...

  1. during initial sync?
  2. during (not just after) replication recovery?

Also to confirm, can in-memory nodes still truncate their oplogs in the absence of transactions? I just want to make sure that is behavior we haven't broken. I'll move the rest of the in-memory discussion to SERVER-38165. CC daniel.gottlieb

Comment by Gregory McKeon (Inactive) [ 25/Mar/19 ]

judah.schvimer, assigning back to you and unassigning from Jesse.

Comment by Judah Schvimer [ 20/Mar/19 ]

Another open question is if this works for in-memory nodes. They currently don't support transactions, but SERVER-38165 is investigating what we need to do there. I don't think this ticket needs to make sure in-memory nodes have sufficient oplog for recovery, but we should file a new ticket to make that work if it doesn't already work as part of this ticket.

Comment by Judah Schvimer [ 08/Mar/19 ]

We also need to make sure the above mechanism prevents us from truncating incorrectly during replication recovery. (I think this will happen with no extra work).

Comment by Judah Schvimer [ 07/Mar/19 ]

jesse, does this plan currently prevent truncation of oplog entries required for recovery that would be truncated during initial sync?

Generated at Thu Feb 08 04:43:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.