[SERVER-23408] Use a smarter batch size for small samples Created: 29/Mar/16  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: Backlog - Query Optimization
Resolution: Unresolved Votes: 3
Labels: dmd-perf, performance, query-44-grooming, storch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-23661 $sample takes disproportionately long... Closed
is duplicated by SERVER-30342 Slow performance of $project in aggre... Closed
Related
Assigned Teams:
Query Optimization
Operating System: ALL
Sprint: Query 2017-08-21, Query 2017-09-11
Participants:
Case:

 Description   

In cases where we can use the optimized sampling from a random cursor, the DocumentSourceCursor will still use the default batch size. This can significantly slow down small sample sizes.

Unfortunately, we cannot just pass down a limit to the DocumentSourceCursor, since the random cursor may give back duplicate documents, in which case we'd need to ask for more.



 Comments   
Comment by Louis Williams [ 06/Oct/20 ]

This problem is very easy to reproduce locally on a collection with 200 documents and a sample size of 1.

let bulk = db.test.initializeUnorderedBulkOp();
for (let i = 0; i < 200; i++) {
    bulk.insert({_id: i});
}
bulk.execute();
db.test.aggregate({$sample: {size: 1}}).itcount();

This aggregation takes 900ms locally and reads 600k worth of data to return 1 document:

{
  "type": "command",
  "ns": "test.test",
  "appName": "MongoDB Shell",
  "command": {
    "aggregate": "test",
    "pipeline": [
      {
        "$sample": {
          "size": 1.0
        }
      }
    ],
    "cursor": {},
    "lsid": {
      "id": {
        "$uuid": "89f59c83-8666-4fba-a526-f93f94ba2cbf"
      }
    },
    "$clusterTime": {
      "clusterTime": {
        "$timestamp": {
          "t": 1602003783,
          "i": 201
        }
      },
      "signature": {
        "hash": {
          "$binary": {
            "base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=",
            "subType": "0"
          }
        },
        "keyId": 0
      }
    },
    "$db": "test"
  },
  "planSummary": "MULTI_ITERATOR",
  "keysExamined": 0,
  "docsExamined": 0,
  "cursorExhausted": true,
  "numYields": 41,
  "nreturned": 1,
  "reslen": 244,
  "locks": {
    "ReplicationStateTransition": {
      "acquireCount": {
        "w": 43
      }
    },
    "Global": {
      "acquireCount": {
        "r": 43
      }
    },
    "Database": {
      "acquireCount": {
        "r": 43
      }
    },
    "Collection": {
      "acquireCount": {
        "r": 43
      }
    },
    "Mutex": {
      "acquireCount": {
        "r": 2
      }
    }
  },
  "storage": {},
  "operationMetrics": {
    "docBytesRead": 662274
  },
  "protocol": "op_msg",
  "durationMillis": 900
}

Generated at Thu Feb 08 04:03:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.