[SERVER-83614] $queue/$lookup pipeline can throw if database dropped at wrong time Created: 27/Nov/23  Updated: 12/Jan/24  Resolved: 12/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Romans Kasperovics Assignee: Henri Nikku
Resolution: Gone away Votes: 0
Labels: greenerbuild
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-83658 Refactor acquisition of CRI across sh... Open
is related to SERVER-83461 Pipeline::canRunOnMongos triggers inv... Closed
is related to SERVER-83611 Do not obtain CRI multiple times in c... Closed
is related to SERVER-79583 Get rid of HostTypeRequirement::kPrim... Open
is related to SERVER-79718 Remove HostTypeRequirement::kPrimaryS... Closed
Assigned Teams:
Query Optimization
Operating System: ALL
Sprint: QO 2023-12-11, QO 2023-12-25, QO 2024-01-08, QO 2024-01-22
Participants:
Linked BF Score: 135

 Description   

An aggregation pipeline like

[{$documents: [{x: 1, z: 'a\0'}, {clean: 2147483647, 'netstat': 'off'}, {x: 3, z: [1, 2, 3]}]},
 {$lookup: {from: 'test_coll', as: 'fullDocument', localField: 'x', foreignField: 'x'}}];

should run on mongos if 'some_db.test_coll' does not exist, or on a shard otherwise. Currently, it will throw if someone drops the database at the wrong moment during the query execution.

The reason for this is the deferred construction of CollectionRoutingInfo in ClusterAggregation::runAggregate().

One possible solution would be to acquire CollectionRoutingInfo only once during query optimization and rewrite the pipeline accordingly. For instance, if we know 'some_db.test_coll' does not exist, we can remove the '$lookup' stage.

When this is done, we should consider replacing the uassert with a tassert in runPipelineOnMongoS(), so that the fuzzer tests can discover unexpected issues.



 Comments   
Comment by Henri Nikku [ 12/Jan/24 ]

Closing this as SERVER-83658 tracks the refactoring work around acquiring CollectionRoutingInfo. Generational fuzzers can't reproduce this issue as the aggregation grammars don't contain $out or $_internalSplitPipeline. Mutational fuzzers already tolerate the existing behavior, which is to uassert.

Comment by Romans Kasperovics [ 27/Nov/23 ]

Here is the script to reproduce the bug (inspired by Mihai's script to reproduce SERVER-82123):

const testDB = db.getSiblingDB('test');
assert.commandWorked(testDB.test_coll.insertOne({a: 15}));
 
const parallelShell = startParallelShell(function() {
    const pipe = [
        {
            $documents:
                [{x: 1, z: 'a\0'}, {clean: 2147483647, 'netstat': 'off'}, {x: 3, z: [1, 2, 3]}]
        },
        {$lookup: {from: 'test_coll', as: 'fullDocument', localField: 'x', foreignField: 'x'}}
    ];
    const result = db.getSiblingDB('test').aggregate(pipe).toArray();
    assert.gte(result.length, 0);
});
 
assert.commandWorked(testDB.dropDatabase());
sleep(4000);
assert.commandWorked(testDB.test_coll.insert({a: 1}));
parallelShell();

... and we need to add some sleeps to the server code:

--- a/src/mongo/s/query/cluster_aggregate.cpp
+++ b/src/mongo/s/query/cluster_aggregate.cpp
@@ -398,7 +398,6 @@ Status ClusterAggregate::runAggregate(OperationContext* opCtx,
         : startsWithQueue                     ? PipelineDataSource::kQueue
                                               : PipelineDataSource::kNormal;
 
+    sleepmillis(2000);
     // If the routing table is not already taken by the higher level, fill it now.
     if (!cri) {
         // If the routing table is valid, we obtain a reference to it. If the table is not valid,
@@ -453,7 +452,6 @@ Status ClusterAggregate::runAggregate(OperationContext* opCtx,
             return Status::OK();
         }
     }
+    sleepmillis(3000);
 
     boost::intrusive_ptr<ExpressionContext> expCtx;
     const auto pipelineBuilder = [&]() {

The command to run the test:

buildscripts/resmoke.py run --suites sharding_jscore_passthrough \
    jstests/noPassthrough/bf-30894.js --userFriendlyOutput=resmoke.txt

Generated at Thu Feb 08 06:52:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.