[SERVER-47184] replSetReconfig command should check if the node is primary before no-op write Created: 30/Mar/20 Updated: 11/May/20 Resolved: 11/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Pavithra Vetriselvan | Assignee: | Siyuan Zhou |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||
| Sprint: | Repl 2020-05-04, Repl 2020-05-18 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
The replSetReconfig command does a no-op write, but does not check that the node is still primary before doing so. Since the command only takes a lock when writing down the config document, it is possible for the primary to stepdown and transition to secondary before doing this no-op write. We end up calling onInternalOpMessage, which will pass in an empty namespace. Because of this, we don't actually do the primary check in _logOpsInner. This would mean that we can allow the reconfig no-op write to occur on a secondary. This is tracked separately from |
| Comments |
| Comment by Tess Avitabile (Inactive) [ 11/May/20 ] |
|
Thank you, sounds good to me. |
| Comment by Siyuan Zhou [ 08/May/20 ] |
|
This bug only occurs when a primary steps down after accepting a reconfig but before writing the no-op. That window including writing the config locally is narrow. Since both reconfig and stepdown are rare, their combination is extremely rare. When this happens, oplog application will complain that the oplog is out of order and fassert. That's how we observed this issue in build failures. Since this isn't reported anywhere even thought it exists in all earlier versions, I tend to close this as Won't Fix. CC tess.avitabile. On 4.4 and master, this has been fixed as part of Safe Reconfig project in |
| Comment by Pavithra Vetriselvan [ 30/Mar/20 ] |
|
Note that this also exists on 4.0 and 3.6. |