Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: 4.4.2
Component/s: None
Labels:

Operating System:
ALL
Sprint:
QO 2021-09-20, QO 2021-10-04, QO 2021-10-18
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We have a query:

{
   "$match":{
      "$and":[
         {
            "$or":[
               {
                  "RecordId":{
                     "$ne":"Some_RecordId_To_Remove"
                  }
               },
               {
                  "DataSource":{
                     "$ne":"Some_DataSource_To_Remove"
                  }
               }
            ]
         },
         {
            "$or":[
               {
                  "DataSource":{
                     "$not":"/.*Base_DataSource.*/i"
                  }
               },
               {
                  "RecordId":"Some_RecordId_To_Add",
                  "DataSource":"Some_DataSource_To_Add"
               }
            ]
         },
         {
            "IsLatestRevision":true
         },
         {
            "IsDeleted":false
         },
         {
            "DeliveryId":"60d05da8db9d210104f31a6b"
         }
      ]
   }
}

Let's ignore the mind-boggling logic on the top part of the filter - the important thing to note is that there is a restriction on the "DeliveryId" field. Why is that of interest?

We run this query against a collection that contains ~29m documents which are semantically split into individual "deliveries" using this "DeliveryId" field. We're talking roughly 20 deliveries or so, each of which consists of either ~6m documents or a substantially smaller number of documents, e.g. 150k.

There is a bunch of indexes on the collection, too, one of which looks like

{ *DeliveryId: 1*, RecordId: 1, DataSource: 1, Revision: 1 }

and, as per my understanding, should be helpful to avoid full scans of all documents whenever a query filters by "DeliveryId".

What we are seeing in the logs sometimes is the below, however:

[...]
originatingCommand: { aggregate: "CollXyz", pipeline: [ { <THE ABOVE $MATCH STAGE> }, { $project: { _id: 1 } } ], allowDiskUse: true, cursor: {}, $db: "DbXyz", lsid: { id: UUID("631bb48c-991e-41ad-8092-8564b72d694e") } } planSummary: IXSCAN { DeliveryId: 1, RecordId: 1, DataSource: 1, Revision: 1 }, IXSCAN { RecordId: 1, DataSource: 1 } cursorid:4730283530601619740 keysExamined:28986788 docsExamined:28849489 cursorExhausted:1 numYields:241934 nreturned:137197 reslen:4004933 locks:{ ReplicationStateTransition: { acquireCount: { w: 241942 } }, Global: { acquireCount: { r: 241942 } }, Database: { acquireCount: { r: 241942 } }, Collection: { acquireCount: { r: 241941 } }, Mutex: { acquireCount: { r: 8 } } } storage:{ data: { bytesRead: 211830295502, timeReadingMicros: 802571385 } } protocol:op_msg 1467513ms

So, every now and again, MongoDB appears to choose to use two indexes:

{ *DeliveryId: 1*, _RecordId: 1, DataSource: 1_, Revision: 1 }
{ _RecordId: 1, DataSource: 1_ }

The things that I would claim are dubious here are

that we end up with keysExamined:28986788 and docsExamined:28849489 (so the whole collection gets read!!!)
but only nreturned:137197 which is basically just the effect of the "DeliveryId" filter part.

I also wonder why the second index gets considered in the first place as the

{ _RecordId: 1, DataSource: 1_ }

combination is already contained in the first index...

Anyway, it would appear that once MongoDB decides to use a different index that does not contain the most important field, we end up somehow scanning the whole collection...?

Some additional business background:

It's probably worth noting that the filter on the top

            "$or":[
               {
                  "RecordId":{
                     "$ne":"Some_RecordId_To_Remove"
                  }
               },
               {
                  "DataSource":{
                     "$ne":"Some_DataSource_To_Remove"
                  }
               }
            ]

is a filter of very limited use as it won't remove any documents from the result set at all in a lot of cases. The above query gets generated based on user input and unless the user specifies some meaningful values for either "RecordId" or "DataSource", it will match on any document in the entire collection. The intention here is to allow a user to remove individual documents from the result based on their "RecordId" and/or "DataSource".

Also, the next part

            "$or":[
               {
                  "DataSource":{
                     "$not":"/.*Base_DataSource.*/i"
                  }
               },
               {
                  "RecordId":"Some_RecordId_To_Add",
                  "DataSource":"Some_DataSource_To_Add"
               }
            ]

is an unusual filter as its purpose is to allow for a user to specify some values for "RecordId" and/or "DataSource" to add some records to a base set ("DataSource":{ "$not":"/.Base_DataSource./i" }) that shall remain included.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

executionStats.json
4 kB
Jun 30 2021 09:54:14 PM UTC

duplicates

SERVER-14423 Plans which fetch different numbers of documents can tie

Closed

is duplicated by

SERVER-57989 Unfortunate index selection leads to too many index/document scans

Closed

Assignee:: Jacob Evans (Inactive)
Reporter:: Daniel Hegener
Participants:: Daniel Hegener, Edwin Zhou, Jacob Evans
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Jun 22 2021 08:58:05 PM UTC
Updated:: Jul 07 2022 03:06:24 PM UTC
Resolved:: Jul 07 2022 03:06:24 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates