Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- cs-subteam1
- sharding-nyc-subteam1

Assigned Teams:

Cluster Scalability
Operating System:
ALL
Story Points:
2
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The sampling-based split policy finds split points by running a cluster aggregate command with a $sample stage to sample N = 10 x numChunks documents. When N is large, the getMore command on each shard can take more than 10 minutes to run. Due to how getMore commands for a cluster aggregate command are sent to shards at the same time and a new set of getMore commands are not sent until all shards have responded to the previous set of getMore commands, cursors on some shards can end up getting killed during the wait. For example, consider a cluster with two shards, shard0 and shard1, where the first getMore command takes 1 minute to run on shard0 but 12 minutes to run on shard1. The second getMore command would fail on shard0 with CursorNotFound since cursors by default are reaped after they have not been used for 10 minutes, which would cause the resharding operation to fail and get retried internally but retries are likely to fail again with the same error.

The proposal here is to run the aggregate command in a logical session since by design a cursor inside a logical session just doesn't get killed until the session expires.

Assignee:: [DO NOT USE] Backlog - Cluster Scalability
Reporter:: Cheahuychou Mao
Participants:: [DO NOT USE] Backlog - Cluster Scalability, Cheahuychou Mao
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Oct 13 2023 02:25:18 AM UTC
Updated:: Dec 12 2023 05:05:05 PM UTC

Details

Description

Attachments

Forms

Activity

People

Dates