[SERVER-61097] SizeStorer can cause deadlocks with cache eviction Created: 28/Oct/21 Updated: 29/Oct/23 Resolved: 16/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.0, 4.4.0 |
| Fix Version/s: | 5.2.0, 5.0.6, 4.4.15, 4.2.21 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Louis Williams | Assignee: | Gregory Wlodarek |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2
|
||||||||||||||||||||||||||||
| Sprint: | Execution Team 2021-11-15, Execution Team 2021-11-29 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 19 | ||||||||||||||||||||||||||||
| Description |
|
Note: This ticket does not fully fix the deadlock described. A complete fix was introduced in This is a follow-up to There's a mutex inside the SizeStorer that serializes access to a global WT session and cursor that we keep open forever. We let multiple threads share it, which is where the mutex comes in. In general, it's not a good idea to hold an exclusive lock and call into the storage engine. The larger problem is that the SizeStorer uses a WT_SESISON that is not the one owned by the calling operation, which may also have its own WT_SESSION. In practice, this has only shown up in importCollection. After the operation has performed a catalog write, it gets stuck inside of SizeStorer::load, holds this mutex, and blocks on cache eviction. WiredTiger will roll back transactions that have written data, but it will not roll back read-only transactions. WiredTiger cannot roll-back the SizeStorer::load() because the SizeStorer uses an entirely separate WT_SESSION than the one that importCollection uses. So even though importCollection has written data, it cannot be rolled back even if it is causing cache issues. Using more than one WT_SESSION per thread is a bug that we've seen before. We should just get rid of this global session + cursor and require that callers pass their own OperationContext. If that's not possible for some reason, we'll need to use "cache_max_wait_ms" to allow the operation to time itself out. |
| Comments |
| Comment by Louis Williams [ 27/Jun/22 ] |
|
Note that these commits did not fix the problem in the description regarding the SizeStorer load() function. Instead, a change was made to a different function, flush, that did not address the problem. |
| Comment by Githook User [ 19/May/22 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit 57e351357ea73450df84b905674b127508aeb064) |
| Comment by Githook User [ 19/May/22 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit 57e351357ea73450df84b905674b127508aeb064) |
| Comment by Githook User [ 15/Dec/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: (cherry picked from commit 57e351357ea73450df84b905674b127508aeb064) |
| Comment by Githook User [ 16/Nov/21 ] |
|
Author: {'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}Message: |