Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-57988

Unfortunate index selection leads to too many index/document scans

    • ALL
    • QO 2021-09-20, QO 2021-10-04, QO 2021-10-18

      We have a query:

       

      {
         "$match":{
            "$and":[
               {
                  "$or":[
                     {
                        "RecordId":{
                           "$ne":"Some_RecordId_To_Remove"
                        }
                     },
                     {
                        "DataSource":{
                           "$ne":"Some_DataSource_To_Remove"
                        }
                     }
                  ]
               },
               {
                  "$or":[
                     {
                        "DataSource":{
                           "$not":"/.*Base_DataSource.*/i"
                        }
                     },
                     {
                        "RecordId":"Some_RecordId_To_Add",
                        "DataSource":"Some_DataSource_To_Add"
                     }
                  ]
               },
               {
                  "IsLatestRevision":true
               },
               {
                  "IsDeleted":false
               },
               {
                  "DeliveryId":"60d05da8db9d210104f31a6b"
               }
            ]
         }
      }

       

      Let's ignore the mind-boggling logic on the top part of the filter - the important thing to note is that there is a restriction on the "DeliveryId" field. Why is that of interest?

      We run this query against a collection that contains ~29m documents which are semantically split into individual "deliveries" using this "DeliveryId" field. We're talking roughly 20 deliveries or so, each of which consists of either ~6m documents or a substantially smaller number of documents, e.g. 150k.

      There is a bunch of indexes on the collection, too, one of which looks like

      { *DeliveryId: 1*, RecordId: 1, DataSource: 1, Revision: 1 }

      and, as per my understanding, should be helpful to avoid full scans of all documents whenever a query filters by "DeliveryId".

      What we are seeing in the logs sometimes is the below, however:

       

      [...]
      originatingCommand: { aggregate: "CollXyz", pipeline: [ { <THE ABOVE $MATCH STAGE> }, { $project: { _id: 1 } } ], allowDiskUse: true, cursor: {}, $db: "DbXyz", lsid: { id: UUID("631bb48c-991e-41ad-8092-8564b72d694e") } } planSummary: IXSCAN { DeliveryId: 1, RecordId: 1, DataSource: 1, Revision: 1 }, IXSCAN { RecordId: 1, DataSource: 1 } cursorid:4730283530601619740 keysExamined:28986788 docsExamined:28849489 cursorExhausted:1 numYields:241934 nreturned:137197 reslen:4004933 locks:{ ReplicationStateTransition: { acquireCount: { w: 241942 } }, Global: { acquireCount: { r: 241942 } }, Database: { acquireCount: { r: 241942 } }, Collection: { acquireCount: { r: 241941 } }, Mutex: { acquireCount: { r: 8 } } } storage:{ data: { bytesRead: 211830295502, timeReadingMicros: 802571385 } } protocol:op_msg 1467513ms

       

      So, every now and again, MongoDB appears to choose to use two indexes:

      • { *DeliveryId: 1*, _RecordId: 1, DataSource: 1_, Revision: 1 }
      • { _RecordId: 1, DataSource: 1_ }

      The things that I would claim are dubious here are

      • that we end up with keysExamined:28986788 and docsExamined:28849489 (so the whole collection gets read!!!)
      • but only nreturned:137197 which is basically just the effect of the "DeliveryId" filter part.

      I also wonder why the second index gets considered in the first place as the

      { _RecordId: 1, DataSource: 1_ }

      combination is already contained in the first index...

      Anyway, it would appear that once MongoDB decides to use a different index that does not contain the most important field, we end up somehow scanning the whole collection...?

       

      Some additional business background:

      It's probably worth noting that the filter on the top

       

                  "$or":[
                     {
                        "RecordId":{
                           "$ne":"Some_RecordId_To_Remove"
                        }
                     },
                     {
                        "DataSource":{
                           "$ne":"Some_DataSource_To_Remove"
                        }
                     }
                  ]

       

      is a filter of very limited use as it won't remove any documents from the result set at all in a lot of cases. The above query gets generated based on user input and unless the user specifies some meaningful values for either "RecordId" or "DataSource", it will match on any document in the entire collection. The intention here is to allow a user to remove individual documents from the result based on their "RecordId" and/or "DataSource".

      Also, the next part

       

                  "$or":[
                     {
                        "DataSource":{
                           "$not":"/.*Base_DataSource.*/i"
                        }
                     },
                     {
                        "RecordId":"Some_RecordId_To_Add",
                        "DataSource":"Some_DataSource_To_Add"
                     }
                  ]
      

       

      is an unusual filter as its purpose is to allow for a user to specify some values for "RecordId" and/or "DataSource" to add some records to a base set ("DataSource":{ "$not":"/.Base_DataSource./i" }) that shall remain included.

       

            Assignee:
            jacob.evans@mongodb.com Jacob Evans
            Reporter:
            daniel.hegener@gmx.net Daniel Hegener
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

              Created:
              Updated:
              Resolved: