Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 5.0.11, 6.0.2, 6.1.0-rc0
Affects Version/s: None
Component/s: Sharding
Labels:
- sharding-nyc-subteam1

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v6.0, v5.0
Sprint:
Sharding 2022-08-08, Sharding 2022-08-22
Linked BF Score:
144
Story Points:
3
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

While a resharding operation is ongoing, every write to the collection being resharded is amplified to a "db.system.resharding.<uuid>" sharded collection (aka temporary resharding collection). This is in part achieved by having ShardingWriteRouter::getReshardingDestinedRecipient() fill in a "destinedRecipient" field into the oplog entries for the inserts, updates, and deletes on the collection being resharded so these oplog entries can be later fetched by the appropriate recipient shard. ShardingWriteRouter calls CatalogCache::getCollectionRoutingInfo() to make this routing decision rather than CatalogCache::getCollectionRoutingInfoWithRefresh(). This is safe if the primary of the donor shard hasn't changed because it will have already refreshed the routing information for the temporary resharding collection earlier. However, if a new primary of the donor shard has been elected then the routing information for the temporary resharding collection may be arbitrarily stale. The routing information being stale is problematic for a couple reasons:

If the routing information for the temporary resharding collection says the collection is unsharded, then ShardingWriteRouter calling ChunkManager::findIntersectingChunkWithSimpleCollation() will result in a segmentation fault.
If the routing information for the temporary resharding collection represents the chunk distribution from a prior resharding attempt, then the recipient shards may miss applying oplog entries and not end up consistent with the collection being resharded.

Running the flushRouterConfig command on all mongod --shardsvr processes before re-attempting a failed resharding operation will prevent the routing information for the temporary resharding collection from being stale.

Assignee:: Max Hirschhorn
Reporter:: Max Hirschhorn
Participants:: Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Aug 08 2022 12:05:15 PM UTC
Updated:: Oct 29 2023 09:34:49 PM UTC
Resolved:: Aug 09 2022 04:46:56 PM UTC
Confidence Status Last Update:: 08/Aug/22 1:55 PM

Details

Description

Attachments

Forms

Activity

People

Dates