[SERVER-33517] Take the oplog collection lock when updating the FCV document Created: 27/Feb/18 Updated: 28/Feb/18 Resolved: 28/Feb/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dianna Hohensee (Inactive) | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Storage NYC 2018-03-12 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 11 | ||||||||||||
| Description |
|
This is necessary in order to prevent last-stable binary secondaries from seeing the FCV document change in the oplog collection BEFORE the primary bumps its internal wire version min and closes incoming connections with last-stable binaries. That would cause the last-stable binary secondary to crash. For example, if a v3.6 secondary obtains the 4.0 FCV value from the v4.0 binary primary's oplog, it will invariant and crash on seeing a BadValue: it only accepts 3.4 and 3.6 FCV field values. See the linked BF for further elaboration on HOW the secondary manages to acquire the oplog entry. |
| Comments |
| Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ] |
|
Closing this as Won't Fix. Created |
| Comment by Andy Schwerin [ 28/Feb/18 ] |
|
Yeah, I guess what I want to indicate is that we don't have to have perfect On Tue, Feb 27, 2018, 8:18 PM Dianna Hohensee (JIRA) <jira@mongodb.org> |
| Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ] |
|
I also perhaps didn't state it clearly in the description that it is a race whether the old binary secondary will crash, and rarely happens: the test covering this scenario has only failed twice. |
| Comment by Dianna Hohensee (Inactive) [ 28/Feb/18 ] |
|
schwerin, I can't tell from your comment whether you are for or against the change proposed by this ticket. Replication decided against crashing last-stable binary replica set members on latest binary featureCompatibilityVersion upgrade. A concern was potentially wiping out all your secondaries, and destroying availability unnecessarily – they could still service reads, if unable to communicate with the primary. I think this was their reasoning, anyway. Such a framework was originally requested in the 4.0 upgrade/downgrade scope, but was later removed. |
| Comment by Andy Schwerin [ 28/Feb/18 ] |
|
Is this a case we need to cover? A secondary running an old binary version crashed when it saw the new fcv during an incorrectly exexuted upgrade. Seems acceptable. |
| Comment by Eric Milkie [ 27/Feb/18 ] |
|
You cannot take a lower level lock without acquiring everything above it: that is the lock hierarchy rule. You can take the local db lock in IX. |
| Comment by Dianna Hohensee (Inactive) [ 27/Feb/18 ] |
|
Just adding the "local.oplog.rs" collection lock in X, I believe. No need to take the "local" database lock in X. I need to block any waiting secondaries from reading the oplog entries until after onCommit changes run and close the last-stable binary connections. |
| Comment by Dianna Hohensee (Inactive) [ 27/Feb/18 ] |
|
Remove the stop replication code from jstests/multiVersion/feature_compatibility_version_lagging_secondary.js along with this commit, as it will no longer be necessary to protect the secondary from crashing. |
| Comment by Eric Milkie [ 27/Feb/18 ] |
|
Updating the FCV document currently requires the following locks: You are proposing to also add the following: Is this correct? |