[SERVER-61097] SizeStorer can cause deadlocks with cache eviction Created: 28/Oct/21  Updated: 29/Oct/23  Resolved: 16/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.0, 4.4.0
Fix Version/s: 5.2.0, 5.0.6, 4.4.15, 4.2.21

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Gregory Wlodarek
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
related to WT-8245 Fix eviction hang during importCollec... Closed
related to SERVER-61116 Audit and add assertions against usin... Backlog
related to SERVER-60334 Avoid caching the cursor and session ... Closed
is related to SERVER-67514 SizeStorer load() can get stuck in pa... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.0, v4.4, v4.2
Sprint: Execution Team 2021-11-15, Execution Team 2021-11-29
Participants:
Linked BF Score: 19

 Description   

Note: This ticket does not fully fix the deadlock described. A complete fix was introduced in SERVER-60334.

This is a follow-up to WT-8245.

There's a mutex inside the SizeStorer that serializes access to a global WT session and cursor that we keep open forever. We let multiple threads share it, which is where the mutex comes in. In general, it's not a good idea to hold an exclusive lock and call into the storage engine.

The larger problem is that the SizeStorer uses a WT_SESISON that is not the one owned by the calling operation, which may also have its own WT_SESSION.

In practice, this has only shown up in importCollection. After the operation has performed a catalog write, it gets stuck inside of SizeStorer::load, holds this mutex, and blocks on cache eviction. WiredTiger will roll back transactions that have written data, but it will not roll back read-only transactions. WiredTiger cannot roll-back the SizeStorer::load() because the SizeStorer uses an entirely separate WT_SESSION than the one that importCollection uses. So even though importCollection has written data, it cannot be rolled back even if it is causing cache issues.

Using more than one WT_SESSION per thread is a bug that we've seen before.

We should just get rid of this global session + cursor and require that callers pass their own OperationContext. If that's not possible for some reason, we'll need to use "cache_max_wait_ms" to allow the operation to time itself out.



 Comments   
Comment by Louis Williams [ 27/Jun/22 ]

Note that these commits did not fix the problem in the description regarding the SizeStorer load() function. Instead, a change was made to a different function, flush, that did not address the problem. SERVER-67514 was opened to fix the original bug in load().

Comment by Githook User [ 19/May/22 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-61097 SizeStorer flush uses 'operation_timeout_ms' to avoid deadlocks with cache eviction

(cherry picked from commit 57e351357ea73450df84b905674b127508aeb064)
(cherry picked from commit 7c65da785398047cb6569170bc614c99abf4ca07)
Branch: v4.2
https://github.com/mongodb/mongo/commit/88ea6629a5980694ff4f4e83781fb4ae310671c1

Comment by Githook User [ 19/May/22 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-61097 SizeStorer flush uses 'operation_timeout_ms' to avoid deadlocks with cache eviction

(cherry picked from commit 57e351357ea73450df84b905674b127508aeb064)
Branch: v4.4
https://github.com/mongodb/mongo/commit/e961f38055121370927ec172007cbd829ccc3014

Comment by Githook User [ 15/Dec/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-61097 SizeStorer flush uses 'operation_timeout_ms' to avoid deadlocks with cache eviction

(cherry picked from commit 57e351357ea73450df84b905674b127508aeb064)
Branch: v5.0
https://github.com/mongodb/mongo/commit/144f3b1bab6434d4610c8671d89623a612d42465

Comment by Githook User [ 16/Nov/21 ]

Author:

{'name': 'Gregory Wlodarek', 'email': 'gregory.wlodarek@mongodb.com', 'username': 'GWlodarek'}

Message: SERVER-61097 SizeStorer flush uses 'operation_timeout_ms' to avoid deadlocks with cache eviction
Branch: master
https://github.com/mongodb/mongo/commit/57e351357ea73450df84b905674b127508aeb064

Generated at Thu Feb 08 05:51:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.