[SERVER-56054] Change minThreads value for replication writer thread pool to 0 Created: 12/Apr/21 Updated: 29/Oct/23 Resolved: 29/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.2.15, 4.4.7, 5.0.0-rc0, 4.0.26 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Andy Schwerin | Assignee: | Lingzhi Deng |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v4.9, v4.4, v4.2, v4.0, v3.6
|
||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Repl 2021-05-03 | ||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 5 | ||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
So far, this behavior has only manifest on systems with glibc versions susceptible to this glibc pthread condition variable bug. While I have not been able to build a minimal reproduction using our ThreadPool type, the scenario proven to exist in this blog post about using TLA+ to model glibc condition variables is perfectly analogous to how replication uses thread pools. In this scenario, a signal delivery that is lost due to the glibc bug leads to incomplete work being left in the thread pool, and no threads waking up to perform the work. Fortunately, a low-risk workaround for this bug as it manifests in the replication system's use of ThreadPool exists. By setting minThreads to 0 instead of its current value, which is equal to maxThreads, we ensure that all waits performed by worker threads eventually wake up due to expiration of the idle thread timeout. The task in this ticket is to change the value of minThreads in the writer thread pool used by replication to 0. This will not eliminate all possible failures due to the glibc bug, but it will eliminate the only one we've seen in practice until such time as the bug in glibc is corrected. |
| Comments |
| Comment by venkataramans rama [ 19/Aug/21 ] |
|
Hi Team,
I see the fix is applied in 4.0.26 but the relevant documentation section of 4.0 is not updated. Could you please update the documentation section so we can confidently enable this parameter in 4.0.26. Thanks, Venkataraman |
| Comment by Githook User [ 24/Jun/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 136fa52193c342038b3fa35152fa1ed3dee4ee87) |
| Comment by Githook User [ 09/Jun/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 136fa52193c342038b3fa35152fa1ed3dee4ee87) |
| Comment by Githook User [ 19/May/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: (cherry picked from commit 136fa52193c342038b3fa35152fa1ed3dee4ee87) |
| Comment by Andy Schwerin [ 03/May/21 ] |
|
veramasu@hcl.com, to be thorough, I checked with the head of our release management team. We will not be applying this change to the 3.6 branch because, while it is expected to be quite safe, we have very little tolerance for risk on branches that are past their supported life. As I understand it, we are unlikely to ever release 3.6.24 except to deal with certain critical security issues. |
| Comment by venkataramans rama [ 30/Apr/21 ] |
|
Thanks Andy for the reply and i understand the support policy.. I see 3.6.24 version internally having some fixes already committed. https://jira.mongodb.org/browse/SERVER-56126?jql=fixVersion%20%3D%203.6.24. So it would be of great help if we get one last patch on 3.6 train with this fix. i request you to consider this request. |
| Comment by Githook User [ 28/Apr/21 ] |
|
Author: {'name': 'Lingzhi Deng', 'email': 'lingzhi.deng@mongodb.com', 'username': 'ldennis'}Message: |
| Comment by Andy Schwerin [ 27/Apr/21 ] |
|
veramasu@hcl.com, I'm afraid that per our support policy, MongoDB 3.6 is not supported past the end of this month, and there are no more scheduled releases of that product branch. I am hoping to get this fix applied to the 4.0 (BACKPORT-8822), 4.2 (BACKPORT-8821) and 4.4 (BACKPORT-8820) branches once it is tested and committed on the main development branch. |
| Comment by venkataramans rama [ 27/Apr/21 ] |
|
Thank you Andy for the proactive fix from mongo. Could you please also back-port this in 3.6 and provide the image? |
| Comment by Andy Schwerin [ 12/Apr/21 ] |
|
If we implement this workaround, I believe we could also add logging after this line. We would log the length of the _pendingTasks queue at this point if the thread woke up after the deadline expired and _pendingTasks were not empty. This would log sometimes when no signal was lost, but given the length of the maxIdleThreadAge value (30 seconds by default), I believe such false positive logging events would be rare. |