[SERVER-41714] $facet operator duplicates documents in the pipeline when preceded by $addFields and $match operators (in this exact order) Created: 13/Jun/19  Updated: 29/Oct/23  Resolved: 28/Jun/19

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 4.0.10
Fix Version/s: 4.3.1, 4.2.20

Type: Bug Priority: Major - P3
Reporter: Łukasz Karczewski Assignee: Xin Hao Zhang (Inactive)
Resolution: Fixed Votes: 0
Labels: Bug, query-44-grooming
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Requested:
v4.2
Sprint: Query 2019-07-01
Participants:

 Description   

db.isMaster() result: 

{ "ismaster" : true, "maxBsonObjectSize" : 16777216, "maxMessageSizeBytes" : 48000000, "maxWriteBatchSize" : 100000, "localTime" : ISODate("2019-06-13T08:59:36.884Z"), "logicalSessionTimeoutMinutes" : 30, "minWireVersion" : 0, "maxWireVersion" : 7, "readOnly" : false, "ok" : 1}

 

Such an operation:

 

db.users.aggregate([  
{ $addFields: { id: '$_id' } },  
{    
  $lookup: {      
     as: 'items_check',      
     from: 'items',      
     let: { id: '$id' },      
     pipeline: [         
        { $addFields: { id: '$_id' } }, 
        { $match: { $expr: { $eq: ['$$id', '$owner'] } } },
        {          
          $facet: {           
            matched: [              
              { $match: { $or: [{ name: 'Item 2' }, { name: 'Item 1' }] }},    
            ],            
            all: [{ $match: {} }],          
          },        
        },      
     ],    
  },  
},  
{ $project: { items_check: 1 } },
])

 

On such collections:

{
 
 users:[{
   id: 'u1',
   name: 'John Smith',
   items: ['i1'],
   gadsden1: 'g1',
   gadsden2: 'g2',
   gadsden3: 'g3',
 },
 {
   id: 'u2',
   name: 'Jane Doe',
   items: ['i2'],
 },
 {
   id: 'u3',
   name: 'Christopher Gadsden',
   items: ['i3'],
 },
 ],
 
 items:[{
   id: 'i1',
   name: 'Item 1',
   owner: 'u1'
 },
 {
   id: 'i2',
   name: 'Item 2',
   owner: 'u2'
 },
 {
   id: 'i3',
   name: 'Item 3',
   owner: 'u3'
 },
 ],
 
 gadsdens:[{
   id: 'g1',
   user1: 'u1',
   user2: 'u2',
   user3: 'u3',
 },
 {
   id: 'g2',
   user1: 'u1',
   user2: 'u1',
   user3: 'u3',
 },
 {
   id: 'g3',
   user1: 'u1',
   user2: 'u2',
   user3: 'u1',
 },
 ],
 }

  returns such a result:

[
 {
 "_id": "u1",
 "items_check": [
 {
 "matched": [
 {
 "_id": "i1",
 "name": "Item 1",
 "owner": "u1",
 "id": "i1"
 },
 {
 "_id": "i1",
 "name": "Item 1",
 "owner": "u1",
 "id": "i1"
 }
 ],
 "all": [
 {
 "_id": "i1",
 "name": "Item 1",
 "owner": "u1",
 "id": "i1"
 },
 {
 "_id": "i1",
 "name": "Item 1",
 "owner": "u1",
 "id": "i1"
 }
 ]
 }
 ]
 },
 {
 "_id": "u2",
 "items_check": [
 {
 "matched": [
 {
 "_id": "i2",
 "name": "Item 2",
 "owner": "u2",
 "id": "i2"
 }
 ],
 "all": [
 {
 "_id": "i2",
 "name": "Item 2",
 "owner": "u2",
 "id": "i2"
 }
 ]
 }
 ]
 },
 {
 "_id": "u3",
 "items_check": [
 {
 "matched": [],
 "all": [
 {
 "_id": "i3",
 "name": "Item 3",
 "owner": "u3",
 "id": "i3"
 }
 ]
 }
 ]
 }
 ]

 

As we can see, the first document in the pipeline gets duplicated items. However, if we swap the $addFields and $match operators in the $lookup pipeline, everything works fine. The same goes for a situation in which the $addFields operator is removed. If I exclude the document with _id equal to u1 then the document with _id equal to u2 will get duplicated items in the items_check field. If I remove the $facet operator from the $lookup pipeline then no documents are duplicated.



 Comments   
Comment by Githook User [ 27/Jun/19 ]

Author:

{'name': 'Xin Hao Zhang', 'username': 'xinhaoz', 'email': 'xinhao.zhang@mongodb.com'}

Message: SERVER-41714 DocumentSourceSequentialDocumentCache should not start serving the cache after returning EOF from source
Branch: master
https://github.com/mongodb/mongo/commit/a065876302c7d309613daa17b5bdb100cf2883d6

Comment by James Wahlin [ 20/Jun/19 ]

This behavior is caused by a difference in assumed aggregation getNext() behavior between the TeeBuffer class (used by DocumentSourceFacet via DocumentSourceTeeConsumer) and DocumentSourceSequentialDocumentCache.

When DocumentSourceSequentialDocumentCache is building its cache, it will iterate over its source until it hits EOF. At that point it will switch its SequentialDocumentCache mode from "building" to "serving" via call to freeze() and will return EOF to the TeeBuffer.

The TeeBuffer loads documents in batches. When it retrieves a batch (via call to TeeBuffer::loadNextBatch()) it will continue to pull documents from its source until it hits an EOF or it reaches its maximum batch size. In the case an EOF is encountered, this return status is swallowed. TeeBuffer::getNext() then relies on checking for an empty buffer after calling loadNextBatch() in order to return an EOF to its consumer. This second, post-EOF call to loadNextBatch() results in a call to DocumentSourceSequentialDocumentCache::getNext() which instead of returning EOF (having switched from "building " to "serving") will instead return its cached document, resulting in the same document being returned twice.

In order to fix this issue we can do one of two things:
1) Modify TeeBuffer so that it no longer assumes that it can call DocumentSource::getNext() and receive EOF after already receiving an EOF from that DocumentSource.
2) Modify DocumentSourceSequentialDocumentCache so that it no longer implicitly switches state from "building" to "serving" on returning EOF. Users of DocumentSourceSequentialDocumentCache would then need to do this work themselves.

Comment by James Wahlin [ 19/Jun/19 ]

Here is a stripped down version of the above reproduction, which fails the assertion on line 37:

(function() {
    "use strict";
 
    db.users.drop();
    db.users.insert([
        {
          _id: "user_1",
        },
    ]);
 
    db.items.drop();
    db.items.insert([
        {_id: "item_1", owner: "user_1"},
    ]);
 
    const response = db.users
                         .aggregate([
                             {
                               $lookup: {
                                   as: 'items_check',
                                   from: 'items',
                                   let : {id: '$_id'},
                                   pipeline: [
                                       {$addFields: {id: '$_id'}},
                                       {$match: {$expr: {$eq: ['$$id', '$owner']}}},
                                       {
                                         $facet: {
                                             all: [{$match: {}}],
                                         },
                                       },
                                   ],
                               },
                             },
                         ])
                         .toArray();
 
    assert.eq(1, response[0].items_check[0].all.length, response);
})();

Comment by Danny Hatcher (Inactive) [ 14/Jun/19 ]

Hello,

Thank you for your report. Please note that unless a given issue refers only to a specific driver, it is best to open problems like these in our general SERVER project. I've forwarded this onto the Query team to take a look.

Generated at Thu Feb 08 04:58:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.