[SERVER-50869] Background sync may erroneously set applied-through during step-up Created: 10/Sep/20 Updated: 29/Oct/23 Resolved: 07/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.2.8 |
| Fix Version/s: | 4.9.0, 4.4.2, 4.2.12 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Russotto | Assignee: | Samyukta Lanka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Requested: |
v4.4, v4.2
|
||||||||||||||||
| Sprint: | Repl 2020-10-05, Repl 2020-10-19 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
The bgsync _producer() method runs in a loop until stop() is called asynchronously. If, after this critical section is run stop() is called (as it would be during step-up), then before is reached, the primary clears the applied-through time (as it normally does), the applied-through time will be re-set to the last applied optime. This state will persist until the next time the node becomes secondary and applies a batch. If the node restarts during that time, it will invariant and need to be re-synced. We need to hold the mutex and ensure the producer is running while checking if applied-through is clear and setting it. |
| Comments |
| Comment by Githook User [ 24/Nov/20 ] |
|
Author: {'name': 'Samy Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}Message: (cherry picked from commit 4fbb1f65e2ad712f6b4e761d3145191e744b1e7d) |
| Comment by Githook User [ 23/Oct/20 ] |
|
Author: {'name': 'Samy Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}Message: (cherry picked from commit 4fbb1f65e2ad712f6b4e761d3145191e744b1e7d) |
| Comment by Githook User [ 07/Oct/20 ] |
|
Author: {'name': 'Samy Lanka', 'email': 'samy.lanka@mongodb.com', 'username': 'lankas'}Message: |
| Comment by Samyukta Lanka [ 02/Oct/20 ] |
|
Holding the mutex while trying to read the appliedThrough doc can cause deadlocks because of the order in which we acquire locks. For example, if a node is in the process of transitioning from RECOVERING to SECONDARY, it will be holding the RSTL in X mode already when it tries to take the bgsync mutex while clearing the node's sync source. That can happen while the bgsync thread is holding the mutex while waiting for the collection lock (meaning it also needs to get the RSTL in a conflicting mode) when trying to read the applied through document, which is a deadlock. |
| Comment by Siyuan Zhou [ 17/Sep/20 ] |
|
samy.lanka, it sounds like the proposed solution holds the mutex while reading the applied-through document on disk. It sounds error-prone. When working on this, could you please confirm that's fine or figure out an alternative solution. |