Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 7.0 Required
Component/s: None
Labels:
None

Operating System:
ALL
Steps To Reproduce:

Hide

Reproduced in 7.0.11 on FreeBSD 14.1 with default package set.

Show
Reproduced in 7.0.11 on FreeBSD 14.1 with default package set.
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The bug is, we encounter this error sometimes during resharding after about 90 minutes. Background & details follow.

command:

db.adminCommand({
	reshardCollection:"pb.work_dates",
	key:{feedyard_id:"hashed",work_date:1},zones:[
		{min:{feedyard_id:MinKey,work_date:MinKey},
		 max:{feedyard_id:NumberLong("-4611686018427387904"),work_date:MinKey},zone:"zone1"},
		{min:{feedyard_id:NumberLong("-4611686018427387904"),work_date:MinKey},
		 max:{feedyard_id:NumberLong("0"),work_date:MinKey},zone:"zone2"},
		{min:{feedyard_id:NumberLong("0"),work_date:MinKey},
		 max:{feedyard_id:NumberLong("4611686018427387904") ,work_date:MinKey},zone:"zone3"},
		{min:{feedyard_id:NumberLong("4611686018427387904") ,work_date:MinKey},
		 max:{feedyard_id:MaxKey,work_date:MaxKey},zone:"zone4"},
	],
}));

output:

{
        "ok" : 0,
        "errmsg" : "Resharding critical section timed out.",
        "code" : 342,
        "codeName" : "ReshardingCriticalSectionTimeout",
        "$clusterTime" : {
                "clusterTime" : Timestamp(1724769403, 2245),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1724769403, 2245)
}

We're planning a process to reshard 22 collections, and just one of them is encountering this error. Our current sharding scheme is suboptimal in a few critical ways, so we're planning an operation to reshard nearly all of our collections. We have a test cluster where we can revert back to a starting-point with ZFS snapshots, which gives us the opportunity to try the exact same resharding process many times with controlled variations, and test our project's code against the result.

One of our collections, called work_dates, can't be resharded by merely running reshardCollection. Doing so NEVER works, and throws the error ReshardingCriticalSectionTimeout. However, if we drop two of the indexes (does not matter which), it ALWAYS works. If we drop only one index, SOMETIMES it works, depending on which snapshot we clone from production. I think it never depends on which index we drop. Just to dispel suspicions of coincidence, these "always" "sometimes" and "never" claims come from around 10 attempts before claiming consistency.

Here's a brief idea of the "shape" of the database. Everything is sharded across 5 replica sets (4 shards + config set), each of which is a 3-way replica set (master and two slaves, no arbiters). Currently, the balancer chooses the ranges, and during the resharding, we're specifying explicit zone ranges to put exactly 1/4 of the range on each shard, as you see in the command above. Work_dates has 6.4M documents, 262GiB, and 7 indexes. Indexes besides the required _id are all two-field compound indexes.

All of the other collections reshard perfectly. Of those that work, one has more indexes (9), one has more documents (281M), but none have more bytes. Work_dates probably has the most bytes in indexes, which I'm suspicious is why it's the only one to trigger ReshardingCriticalSectionTimeout.

Although I've no idea what the error means, the fact that others reshard perfectly, and that we can make it work by dropping any two indexes, gives me confidence that it's not something weird about our configuration, but is a genuine bug.

related to

SERVER-84769 Resharding remainingOpTime algorithm doesn't work with low elapsedTime

Closed

Assignee:: Rhea Thorne
Reporter:: Leif Pedersen
Participants:: Leif Pedersen, Ratika Gandhi, Rhea Thorne
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Aug 27 2024 11:00:31 PM UTC
Updated:: Mar 04 2025 05:30:34 PM UTC
Resolved:: Mar 04 2025 05:30:33 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates