[SERVER-33517] Take the oplog collection lock when updating the FCV document Created: 27/Feb/18  Updated: 28/Feb/18  Resolved: 28/Feb/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Dianna Hohensee (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-33552 Halt replication on last-stable secon... Closed
Operating System: ALL
Sprint: Storage NYC 2018-03-12
Participants:
Linked BF Score: 11

 Description   

This is necessary in order to prevent last-stable binary secondaries from seeing the FCV document change in the oplog collection BEFORE the primary bumps its internal wire version min and closes incoming connections with last-stable binaries. That would cause the last-stable binary secondary to crash.

For example, if a v3.6 secondary obtains the 4.0 FCV value from the v4.0 binary primary's oplog, it will invariant and crash on seeing a BadValue: it only accepts 3.4 and 3.6 FCV field values.

See the linked BF for further elaboration on HOW the secondary manages to acquire the oplog entry.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ]

Closing this as Won't Fix. Created SERVER-33552 to modify the test so that it stops failing.

Comment by Andy Schwerin [ 28/Feb/18 ]

Yeah, I guess what I want to indicate is that we don't have to have perfect
behavior here. While we'd prefer the last-stable node to stay up, it's not
worth adding a lot of risk or complexity to the server. As such, it would
also be reasonable to change the tests to not fail in these scenarios.

On Tue, Feb 27, 2018, 8:18 PM Dianna Hohensee (JIRA) <jira@mongodb.org>

Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ]

I also perhaps didn't state it clearly in the description that it is a race whether the old binary secondary will crash, and rarely happens: the test covering this scenario has only failed twice.

Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ]

schwerin, I can't tell from your comment whether you are for or against the change proposed by this ticket.

Replication decided against crashing last-stable binary replica set members on latest binary featureCompatibilityVersion upgrade. A concern was potentially wiping out all your secondaries, and destroying availability unnecessarily – they could still service reads, if unable to communicate with the primary. I think this was their reasoning, anyway.

Such a framework was originally requested in the 4.0 upgrade/downgrade scope, but was later removed.

Comment by Andy Schwerin [ 28/Feb/18 ]

Is this a case we need to cover? A secondary running an old binary version crashed when it saw the new fcv during an incorrectly exexuted upgrade. Seems acceptable.

Comment by Eric Milkie [ 27/Feb/18 ]

You cannot take a lower level lock without acquiring everything above it: that is the lock hierarchy rule. You can take the local db lock in IX.
Taking multiple db locks might result in a deadlock; you might have to use lock timeouts in order to avoid a full deadlock.

Comment by Dianna Hohensee (Inactive) [ 27/Feb/18 ]

Just adding the "local.oplog.rs" collection lock in X, I believe. No need to take the "local" database lock in X.

I need to block any waiting secondaries from reading the oplog entries until after onCommit changes run and close the last-stable binary connections.

Comment by Dianna Hohensee (Inactive) [ 27/Feb/18 ]

Remove the stop replication code from jstests/multiVersion/feature_compatibility_version_lagging_secondary.js along with this commit, as it will no longer be necessary to protect the secondary from crashing.

Comment by Eric Milkie [ 27/Feb/18 ]

Updating the FCV document currently requires the following locks:
Global lock in IX
"admin" db lock in IX
"admin.system.version" collection lock in IX

You are proposing to also add the following:
"local" db lock in X (?)
"local.oplog.rs" collection lock in ... X?

Is this correct?

Generated at Thu Feb 08 04:33:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.