[SERVER-54638] replSetReconfigure will cause some operations to hang indefinitely Created: 19/Feb/21 Updated: 29/Oct/23 Resolved: 04/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Replication |
| Affects Version/s: | 4.0.0, 4.0.1, 3.6.20, 3.6.22 |
| Fix Version/s: | 3.6.24 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James MacMahon | Assignee: | Vishnu Kaushik |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Steps To Reproduce: | I've pushed the scripts I used to reproduce this issue here: https://github.com/jmpesp/mongo_3.6.20_concurrency_bug_repro. All that is required is to perform a replSetReconfig during an operation of any type (the linked repo uses a simple UPDATE). |
| Sprint: | Repl 2021-03-22, Repl 2021-04-05 |
| Participants: |
| Description |
|
The following conditions create a situation where an operation hangs indefinitely: 1. During a replSetReconfig, the variable _currentCommittedSnapshot is set to boost::none in ReplicationCoordinatorImpl::_dropAllSnapshots_inlock. The bug occurs when a ThreadWaiter is removed from the waiter list by the thread performing the replSetReconfig because _doneWaitingForReplication_inlock returns true, but that same thread nulls out _currentCommittedSnapshot, and therefore operation thread's call to _doneWaitingForReplication_inlock returns false. The ThreadWaiter was removed from the list of waiters, and can never be signaled again. I've detailed this problem in a blog post on my company's engineering blog: https://engineering.vena.io/2021/02/19/what-to-do-when-mongo-3-6-wont-return-your-calls/ I'm opening this against 3.6.20 because the support policy (https://www.mongodb.com/support-policy) shows support extending until April 2021. I've tested the following patch, and it seems to fix it:
But I also believe that cherry picking https://github.com/mongodb/mongo/commit/fe1b92cee5c133e82845ffbd31b25ab5b66084d3 would fix this issue as well (note I haven't tested this). I was able to reproduce this issue on versions 4.0.0 and 4.0.1, but not 4.0.2, and that commit exists between 4.0.1 and 4.0.2. |
| Comments |
| Comment by Vishnu Kaushik [ 04/Apr/21 ] |
|
Thank you for the bug report! The issue has been resolved by backporting this commit. |