Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 5.0.0, 6.0.0, 7.0.0, 8.1.0-rc0, 8.0.0
Component/s: None
Labels:
None

Assigned Teams:

Catalog and Routing
Operating System:
ALL
Sprint:
CAR Team 2024-10-28, CAR Team 2024-11-11, CAR Team 2024-11-25, CAR Team 2024-12-23, CAR Team 2025-01-06, CAR Team 2025-01-20, CAR Team 2025-02-03, CAR Team 2025-02-17, CAR Team 2025-03-03, CAR Team 2025-03-17, CAR Team 2025-03-31, CAR Team 2025-04-14, CAR Team 2025-04-28, CAR Team 2025-05-12, CAR Team 2025-05-26, CAR Team 2025-06-09
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

An unlikely sequence of config server stepdowns and concurrent operations during the execution of the removeShardFromZone command can create an inconsistency by leaving a zone without any shards associated (usually, a ZoneStillInUse error would be thrown).

Reproduction steps:

Add a fail point hangRemoveShardFromZoneBeforeCommit in the following line of code.
Start a sharded cluster with at least two nodes in the config server replica set and --setParameter enableTestCommands=1.
On the mongos router, add a shard to a zone: sh.addShardToZone("shard01", "NYC")
On the CSRS primary, enable the fail point: db.adminCommand({configureFailPoint: 'hangRemoveShardFromZoneBeforeCommit', mode: "alwaysOn"})
On the mongos router, remove the shard from the zone: sh.removeShardFromZone("shard01", "NYC"). The command will hang.
Step down the CSRS primary: rs.stepDown()
From the mongos router (on another shell), associate the zone to a key range: sh.updateZoneKeyRange("records.users", { zipcode: "10001" }, { zipcode: "10281" }, "NYC")
Step down CSRS primary(es) until the original config server primary gets elected primary again: rs.stepDown()
From the CSRS primary, disable the fail point: db.adminCommand({configureFailPoint: 'hangRemoveShardFromZoneBeforeCommit', mode: "off"})

At this point, the hung removeShardFromZone command will complete successfully and the zone will be orphaned.

> db.getSiblingDB("config").tags.find()
{ "_id" : ObjectId("67082f58eee0b05d5418e785"), "ns" : "records.users", "min" : { "zipcode" : "10001" }, "max" : { "zipcode" : "10281" }, "tag" : "NYC" }
> db.getSiblingDB("config").shards.find()
{ "_id" : "shard01", "host" : "shard01/localhost:27018", "state" : 1, "topologyTime" : Timestamp(1728589374, 13), "replSetConfigVersion" : NumberLong(-1), "tags" : [ ] }

A fix possibility is adding a call to opCtx->setAlwaysInterruptAtStepDownOrUp_UNSAFE(); at the beginning of the command.

Assignee:: Joan Bruguera Micó
Reporter:: Joan Bruguera Micó
Participants:: Joan Bruguera Micó
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Oct 11 2024 09:50:32 AM UTC
Updated:: May 27 2025 09:27:01 AM UTC
Confidence Status Last Update:: 26/Nov/24 3:15 PM

Details

Description

Attachments

Activity

People

Dates