Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-82160

Make SamplingBasedSplitPolicy run $sample aggregation in a logical session

    • Cluster Scalability
    • ALL
    • 2

      The sampling-based split policy finds split points by running a cluster aggregate command with a $sample stage to sample N = 10 x numChunks documents. When N is large, the getMore command on each shard can take more than 10 minutes to run. Due to how getMore commands for a cluster aggregate command are sent to shards at the same time and a new set of getMore commands are not sent until all shards have responded to the previous set of getMore commands, cursors on some shards can end up getting killed during the wait. For example, consider a cluster with two shards, shard0 and shard1, where the first getMore command takes 1 minute to run on shard0 but 12 minutes to run on shard1. The second getMore command would fail on shard0 with CursorNotFound since cursors by default are reaped after they have not been used for 10 minutes, which would cause the resharding operation to fail and get retried internally but retries are likely to fail again with the same error.

      The proposal here is to run the aggregate command in a logical session since by design a cursor inside a logical session just doesn't get killed until the session expires

            Assignee:
            backlog-server-cluster-scalability [DO NOT USE] Backlog - Cluster Scalability
            Reporter:
            cheahuychou.mao@mongodb.com Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: