[SERVER-60073] Manual chunks cleanup intervention race with other create collection attempts Created: 20/Sep/21  Updated: 10/Jun/22  Resolved: 10/Jun/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.3
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Tommaso Tocci Assignee: Pierlauro Sciarelli
Resolution: Won't Fix Votes: 0
Labels: sharding-wfbf-day, testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Sprint: Sharding EMEA 2021-11-15, Sharding EMEA 2021-11-29, Sharding EMEA 2021-12-13, Sharding EMEA 2021-12-27, Sharding EMEA 2022-01-10, Sharding EMEA 2022-01-24, Sharding EMEA 2022-02-07, Sharding EMEA 2022-02-21, Sharding EMEA 2022-03-07, Sharding EMEA 2022-03-21, Sharding EMEA 2022-04-04, Sharding EMEA 2022-04-18, Sharding EMEA 2022-05-02, Sharding EMEA 2022-05-16, Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13
Participants:
Linked BF Score: 0

 Description   

In jstests/concurrency/fsm_workloads/random_DDL_setFCV_operations.js it can happen that we encounter a ManualIntervetionRequired error when trying to shard a collection.

This means that a previous shard collection attempt in FCV 4.4 managed to create some chunks for a collection but it crashed or stepped down before to actually write the relative entry in config.collection. Leaving orphaned chunks in config.chunks.

When this occurs all the threads that received the ManualInterventionRequired error will attempt to directly remove the orphaned chunk documents from config.chunks and they will retry to shard the collection.

Since there is no synchronization between these threads, it can totally happen that:

  • T1 receives ManualInterventionRequired for coll1
  • T2 receives ManualInterventionRequired for coll1
  • T1 removes orphaned chunks for coll1
  • T1 re-issue the shard collection and correctly create coll1 with its own chunks
  • T2 removes the chunks for coll1
  • T2 re-issue the shard collection and find the collection is already sharded so it does nothing

So T2 will leave coll1 with a collection entry in config.collection but no chunks accounted in config.chunks.

In this situation every nodes that will try to refresh its catalog cache for coll1 will encounter a ConflictingOperationInProgress error



 Comments   
Comment by Pierlauro Sciarelli [ 10/Jun/22 ]

tommaso.tocci@mongodb.com I would propose to close this as won't fix (or to keep it on the backlog) as the failure only happened once in the last year and there is no easy way to work around it without rewriting the test to be sure no threads act concurrently on the same collection (that goes against the purpose of the test itself)

Generated at Thu Feb 08 05:48:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.