[SERVER-44250] startIndexBuild oplog write and thread pool scheduling are not serialized between concurrent threads on primaries Created: 25/Oct/19  Updated: 29/Oct/23  Resolved: 13/Nov/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.3.2

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-43692 enable two phase index builds by default Closed
Related
related to SERVER-45262 make IndexBuildsCoordinator thread po... Closed
related to SERVER-74953 Explore avoiding stepdowns during the... Closed
is related to SERVER-44609 Replicate startIndexBuild oplog entry... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2019-11-18
Participants:
Linked BF Score: 13

 Description   

Secondaries serialize all oplog commands, which means that the code in startIndexBuild  to 1) write the "startIndexBuild" oplog entry and 2) schedule the task on the thread pool cannot race with other threads doing the same thing.

On primares, however, these two operations are not protected from being concurrent, so it would be possible to have two concurrent threads interleave. This leads to a situation described below where the thread pool size is only 1:

  • Start and replicate a "startIndexBuild" oplog entry for index A
    • The secondary starts building index A
  • Start and replicate a "startIndexBuild" oplog entry for index B
  • Schedule index build B on the thread pool on the primary
    • The primary starts building index B
  • Queue up index build B on the primary because all threads are in use, and block.
  • Commit and replicate "commitIndexBuild" for index B
  • The secondary attempts to apply this oplog entry and blocks because index B has not started
    • Index B cannot start until index A commits
    • Index A cannot commit until it replicates the commitIndexBuild oplog entry, leading to a deadlock scenario.

 

The following original description does not accurately describe the full problem:

We limit the maximum number of index build worker threads to 10, but there is no high-level restriction on the number of active index build threads.

This is problematic for secondaries in the following scenario:

  • Start, but do not commit 10 index builds on the primary, replicating 10 "startIndexBuild" oplog entries and starting 10 worker threads.
  • Start and commit an 11th index build on the primary, replicating a "startIndexBuild" and "commitIndexBuild" oplog entry.
    • Because there are already 10 index builds active on the secondary, this index build will queue up in "_pendingTasks", but it will not start.
  • Replication of the "commitIndexBuild" oplog entry will wait for the 11th index build's thread to join, blocking until it does.
    • This in turn blocks other "commitIndexBuild" oplog entries from joining other index build threads, causing this hang.

We should do one of the following:

  • Limit the maximum number of active index builds allowed on the primary
    • This should be the same as the maximum number of worker threads. We would enforce this by either returning an error to the user, or just block until resources are avialable. This would prevent the problem on secondaries as long as the limits are identical, otherwise this would not work.
  • Do not limit the maximum number of index build worker threads


 Comments   
Comment by Githook User [ 13/Nov/19 ]

Author:

{'username': 'louiswilliams', 'email': 'louis.williams@mongodb.com', 'name': 'Louis Williams'}

Message: SERVER-44250 serialize startIndexBuild oplog write and thread pool scheduling between concurrent threads on primaries
Branch: master
https://github.com/mongodb/mongo/commit/5074799696cdff95f46b81f054f04b2a55a1e2bc

Comment by Louis Williams [ 13/Nov/19 ]

We're going to use a mutex for now to enable test coverage. It behaves correctly, but it depends on thread pool behavior that is subject to change in the future. The plan is to follow-up with SERVER-44609 to implement the solution that does not depend on the thread pool's internal queueing.

Comment by Louis Williams [ 08/Nov/19 ]

There are two ways I see of fixing this bug:

  1. Use a mutex to protect index build initialization (i.e. replicating "startIndexBuild") and thread pool scheduling
  2. Move index initialization into the builder thread. It seems like these two were intentionally separated by SERVER-39369, so we may want to explore if it's possible to put them back together.
Generated at Thu Feb 08 05:05:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.