[SERVER-65969] Migration completion must not be signaled before releasing the ActiveMigrationRegistry Created: 26/Apr/22  Updated: 29/Oct/23  Resolved: 17/May/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc8, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Paolo Polato
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0
Sprint: Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30
Participants:

 Description   

This may be a very rare race condition, but it's worth mentioning it since it has required a lot of investigation on a failing test in a patch. It can happen if the CSRS steps down during any tests issuing 2 subsequent moveChunk commands on different ranges (e.g. here).

When a _shardsvrMoveRange command (moveChunk in previous versions) is joining an ongoing migration, it waits for the completion of the original migration that is signaled before releasing the ActiveMigrationRegistry.

As a result, the following flow could be reproduced:

  1. Router sends moveChunk to CSRS node A
  2. CSRS node A sends _shardsvrMoveRange to shard
  3. CSRS node A steps-down and CSRS node B steps up
  4. Router receives an error from CSRS node A, retries the moveChunk
  5. CSRS node B sends _shardsvrMoveRange to shard, joining ongoing migration
  6. The ongoing migration succeeds, signals completion before releasing the ActiveMigrationRegistry
  7. [very fast] Router receives success from CSRS node B, sends a new moveChunk for a different range
  8. [very fast] CSRS B forwards the new operation to the shard
  9. Shard replies with error because the ActiveMigrationRegistry has not been released yet (so the test fails)


 Comments   
Comment by Githook User [ 29/May/22 ]

Author:

{'name': 'Paolo Polato', 'email': 'paolo.polato@mongodb.com', 'username': 'ppolato'}

Message: SERVER-65969 Fix data race in ScopedDonateChunk

(cherry picked from commit bfee7c7eaef29fef0a1ec443d0527e335c18d756)
(cherry picked from commit cb69b7f2bcfe4d7eb50da54d6e308b61ff29e6f6)
Branch: v6.0
https://github.com/mongodb/mongo/commit/491f15aaca5d9c4f8a808f6a0a449ad6499467e3

Comment by Githook User [ 17/May/22 ]

Author:

{'name': 'Paolo Polato', 'email': 'paolo.polato@mongodb.com', 'username': 'ppolato'}

Message: SERVER-65969 Fix data race in ScopedDonateChunk
Branch: master
https://github.com/mongodb/mongo/commit/bfee7c7eaef29fef0a1ec443d0527e335c18d756

Generated at Thu Feb 08 06:04:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.