[SERVER-76848] [false alarm] $out does not ensure the node remains primary throughout the internal rename Created: 04/May/23  Updated: 29/Oct/23  Resolved: 11/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 7.1.0-rc0, 6.0.6, 5.0.17, 4.4.21, 7.0.0-rc1
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Gil Alon Assignee: Silvia Surroca
Resolution: Fixed Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER73316_test.js    
Issue Links:
Backports
Depends
depends on SERVER-77545 Wrap DB DDL lock acquisition under a ... Closed
Duplicate
duplicates SERVER-77545 Wrap DB DDL lock acquisition under a ... Closed
Related
is related to SERVER-76626 Investigate test failures for concurr... Closed
Tested
tested by SERVER-78852 Test movePrimary and $out running con... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.3, v6.0, v5.0, v4.4
Sprint: Sharding EMEA 2023-07-10, Sharding EMEA 2023-07-24, QI 2023-05-15
Participants:
Story Points: 2

 Description   

The implementation of $out created a special internal rename command (InternalRenameIfOptionsAndIndexesMatchCmd). However, this command implements its own locks to avoid concurrent modifications, but there is an error in the implementation. On this line there is a call to assertIsPrimaryShardForDb, but there is no guarantee this node will remain the primary through the entire execution of $out. The usual pattern to ensure the node remains a primary is: 

  1. Wait for ShardingDDLCoordinator service recovery.
  2. Take database DDL lock to serialize with concurrent movePrimary operations that would change the db primary shard.
  3. Check if this shard is primary for the database.
  4. Acquire additional DDL locks if needed.
  5. Execute operation while holding the locks.

However, there is an existing _shardsvrRenameCollection command that already has the correct locking mechanism and ensures the database is the primary shard. We should see if we can use _shardsvrRenameCollection in $out, or we should fix $out to work with concurrent movePrimary commands. We will also need to expand our testing, since the current tests don't allow $out to be run in suites that kill the primary node and we should add movePrimary commands to the current concurrency test.

This came up in SERVER-76626 during a bug investigation with concurrent rename and shard collection commands were failing with $out writing to time-series collections.

-------------

[UPDATE - 8th of September 2023]: This is not a bug, movePrimary and the internal rename of $out are correctly serialized (here and here) through the check of isMovePrimaryInProgress flag.


Generated at Thu Feb 08 06:33:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.