[SERVER-32034] Replica Set primary becomes unresponsive with adaptive Service Executor Created: 19/Nov/17 Updated: 30/Oct/23 Resolved: 05/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking, Stability |
| Affects Version/s: | 3.6.0-rc4 |
| Fix Version/s: | 3.6.4, 3.7.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Josef Ahmad | Assignee: | Jonathan Reams |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Backport Requested: |
v3.6
|
||||||||
| Steps To Reproduce: |
|
||||||||
| Sprint: | Platforms 2017-12-18, Platforms 2018-01-01, Platforms 2018-01-15 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
Reproduced on version r3.6.0-rc4-41-ge608b8b349 On a three-node replica set, I generated traffic via the load.java attached. (Note the original load.java comes from from
The above runs successfully with the default (synchronous) settings. For the test I used four AWS EC2 m4.4xlarge, one per replica set member, the fourth one for the traffic generator. The snapshot below shows that the update workload starts at A and then performance degrades as the number of connections increases. From log of the secondary that transitions to primary:
Attached logs, diagnostic data and application output. |
| Comments |
| Comment by Githook User [ 01/Mar/18 ] |
|
Author: {'email': 'jbreams@mongodb.com', 'name': 'Jonathan Reams', 'username': 'jbreams'}Message: (cherry picked from commit d65dde869399fe13d440be986209517ceea9efa3) |
| Comment by Githook User [ 05/Jan/18 ] |
|
Author: {'name': 'Jonathan Reams', 'username': 'jbreams', 'email': 'jbreams@mongodb.com'}Message: |
| Comment by Githook User [ 05/Jan/18 ] |
|
Author: {'name': 'Jonathan Reams', 'username': 'jbreams', 'email': 'jbreams@mongodb.com'}Message: |
| Comment by Jonathan Reams [ 18/Dec/17 ] |
|
I believe what's going on here is that, under heavy load, pending threads are preventing the starvation avoidance in the executor controller from ever kicking in. Also, spurious wakeups were causing stuck thread detection from running at its required interval. Therefore we aren't able to spawn threads quickly enough to deal with the many long-running tasks in a majority-write load test. In an unsuccessful run, we see enough threads started to handle the insert load, and then quick thread starvation while trying to handle the update load, we also see the number of pending threads spike just before the failure: In a successful run, we see many more threads started by starvation avoidance, no pending threads at all, and much fewer stuck threads during the update than insert portion of the test: |