[SERVER-44953] Secondaries should restart index builds when a commitIndexBuild oplog entry is processed but no index build is active Created: 04/Dec/19 Updated: 29/Oct/23 Resolved: 22/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Louis Williams |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | Using this failpoint:
This test will fail:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Execution Team 2020-01-13, Execution Team 2020-01-27 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 17 | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
The sequence is as follows:
See this patch build. |
| Comments |
| Comment by William Schultz (Inactive) [ 31/Jan/20 ] | |
|
louis.williams It looks like this ticket added NamespaceNotFound as an acceptable error to ignore during oplog application of "commitIndexBuild" and "startIndexBuild" entries. It seems we also need to do so for "abortIndexBuild". I ran into a failure in a patch build where we crash when applying "abortIndexBuild" in initial sync due to a NamespaceNotFound error thrown here. Here is a repro of the case (initsync_fuzzer-f628-1580268652754-5.js
| |
| Comment by Githook User [ 18/Jan/20 ] | |
|
Author: {'username': 'louiswilliams', 'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com'}Message: Additionally, only abort an index build after a user interrupt if we are still primary. During | |
| Comment by Louis Williams [ 09/Jan/20 ] | |
|
To fix this, when we receive user interruptions, only abort index builds if we are still primary after reacquiring locks. Otherwise, let the new primary finish the index build. | |
| Comment by Louis Williams [ 07/Jan/20 ] | |
|
To address my previous concern: I do not consider this a serious issue because this race condition is inherently hard to hit. I am relatively confident that this is not an area where we want to focus on performance. This also goes for | |
| Comment by Louis Williams [ 06/Dec/19 ] | |
|
I think the best way to solve this problem is to introduce a safe fallthrough on the secondary: if a "commitIndexBuild" oplog entry is recieved and no index build is in-progress, start the index build and block. This should be a reasonable fallback in any future case where a stepdown causes an index build to be cleaned up incorrectly. Update: With the introduction of a commitQuorum, we should consider the case where this scenario happens to a majority of secondaries. If a majority of nodes lose knowledge of an index build, the primary may hang for a majority of nodes to commit. Another option would be to modify the consensus protocol to allow a failing secondary to signal to the new primary to abort the index build. |