[SERVER-55486] Design new integration test facility to use Device-mapper to reproduce disk outage Created: 24/Mar/21 Updated: 06/Dec/22 Resolved: 23/Apr/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | [DO NOT USE] Backlog - Sharding NYC |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | sharding-product-sync | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||
| Participants: | |||||||||
| Description |
BackgroundWe have a HELP ticket scenario when a faulty disk made all disk I/O to be blocked for indefinite time, which caused the process to enter the uninterruptible sleep state. The main culprit of this state is that when SIGKILL is issued the process is not killed because it's blocked on a syscall. The user killed the primary mongod server with -9 but it was not killed. After 13 minutes after SIGKILL, the user had to shut down the Amazon EC2 instance to break down hung sessions from multiple mongos proxies to the faulty primary. This happened with v4.0. More background on why `kill -9` will never kill the process in the uninterruptible sleep state: Various tricks people use to simulate the uninterruptible sleep state: More background on why kernel prevents killing process in this kind of state: and LWN article: https://lwn.net/Articles/288056/ Implementation detailsThe process of setting the device mapper has multiple steps: This procedure was already done manually and fully reproduced the production outage. Not the same as network proxyPlease note that we already have mongobridge to simulate network errors, however this is not the same. The mongo bridge cannot make the outage in the mongod, it can only make the client to think that mongod has an outage, which is very different from the scenario in HELP ticket. |
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 24/Mar/21 ] |
|
The trick is that device mapper can be suspended, while loopback device cannot. But device mapper cannot map directly from file. So the topology is: file -> loopback device -> device mapper |
| Comment by Andrew Shuvalov (Inactive) [ 24/Mar/21 ] |
|
References: this is related to HELP-22913. The document on mongo bridge is: |