[SERVER-53376] [4.4] dbHash can live lock an aborting index build Created: 15/Dec/20  Updated: 29/Oct/23  Resolved: 06/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.4.4

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-58969 [4.4] Lower dbHash and background val... Closed
is related to SERVER-53445 [4.4] impose lock acquisition timeout... Closed
is related to SERVER-57192 [4.4] Lower dbHash and background val... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2021-01-11
Participants:
Linked BF Score: 22

 Description   

dbHash is allowed to hold open storage snapshots indefinitely while waiting for collection locks. Multi-doc transactions do this, but they have lock acquisition deadlines.

This behavior introduces a very specific live lock in 4.4:

  • dbHash opens a read snapshot using the current cluster time
  • An index build aborts. While holding an X collection lock, the index build attempts to set a ghost commit timestamp for the catalog write using the same cluster time.
  • Due to an assertion in WT, setting the ghost timestamp will fail because there is an open transaction (dbHash) reading at the same timestamp.
  • The index build retries indefinitely, waiting for the dbHash reader to finish.
  • The dbHash operation is unable to make progress because it is blocked by the X lock.

In general, I believe we should impose a lock timeout such that dbHash cannot hold open snapshots and block indefinitely, much like we already do for multi-document transactions.

The alternative to imposing a lock deadline would be to fix ghost timestamps, but only in 4.4. I believe a dbHash change will avoid the risk of modifying 4.4 index build code that has been removed in master. Adding a lock timeout to dbHash assumes there are no other consequences of the index build ghost timestamping behavior. This same bug applies to background validation, which will need to undergo the same lock timeout change (SERVER-53445).



 Comments   
Comment by Ian Whalen (Inactive) [ 07/Jan/21 ]

Author:

{'username': u'evrg-bot-webhook', 'name': u'Louis Williams', 'email': u'louis.williams@mongodb.com'}

Message:SERVER-53376 Impose maximum lock timeout for dbHash
Branch:v4.4
https://github.com/mongodb/mongo/commit/1034e166dd7cbe72b2a9b772b24f582fd701a14e

Comment by Louis Williams [ 06/Jan/21 ]

https://github.com/mongodb/mongo/commit/1034e166dd7cbe72b2a9b772b24f582fd701a14e

Generated at Thu Feb 08 05:30:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.