Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-210

When inferring the schema make the pool size limitable

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 2.1.4, 2.2.5, 2.3.1, 2.4.0
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      Currently hardcoded to 10,000 for MongoDB < 3.2 but for newer versions it can sample the whole collection. For large collections this is slow and inefficient. Allow a limit to be set before sampling the data and make it configurable so users can further reduce the cost of schema inference.

      Behavior
      $sample uses one of two methods to obtain N random documents, depending on the size of the collection, the size of N, and $sample’s position in the pipeline.

      If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:

      • $sample is the first stage of the pipeline
      • N is less than 5% of the total documents in the collection
      • The collection contains more than 100 documents

      If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents. In this case, the $sample stage is subject to the sort memory restrictions.

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            ross@mongodb.com Ross Lawley
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: