[SERVER-63773] Per Shard cursor post batch resume token not set in getMore responses Created: 17/Feb/22  Updated: 29/Oct/23  Resolved: 10/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0, 4.4.18, 4.2.24, 5.0.13

Type: Bug Priority: Major - P3
Reporter: Jennifer Peshansky (Inactive) Assignee: Jennifer Peshansky (Inactive)
Resolution: Fixed Votes: 0
Labels: pm-2419-commit
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
backported by SERVER-69002 [5.0] backport pm-2419 Per Shard Curs... Closed
backported by SERVER-69003 [4.4] backport pm-2419 Per Shard Curs... Closed
backported by SERVER-69004 [4.2] backport pm-2419 Per Shard Curs... Closed
Depends
is depended on by SERVER-64228 Per shard cursor assert.soon fails du... Closed
is depended on by SERVER-63771 Fix all temporarily blocked tests in ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: QE 2022-03-07, QE 2022-03-21
Participants:

 Description   

Failing test: start_after_invalidation_exception.js

Error: assert.soon failed: () => cursor.hasNext()

Symptoms: lots of noop timed out

 

variant logs run notes
RHEL logs 2 passing logs
RHEL all feature flags logs 3 passing logs
ubuntu ASAN AFF logs 2  
ubuntu UBSAN AFF logs 2  
RHEL all feature flags logs 9 not this error but same cause


 Comments   
Comment by Githook User [ 08/Nov/22 ]

Author:

{'name': 'Mickey. J Winters', 'email': 'mickey.winters@mongodb.com', 'username': 'mjrb'}

Message: SERVER-69004 backport per shard cursors

SERVER-62400
SERVER-62681
SERVER-62738
SERVER-63781
SERVER-63774
SERVER-63772
SERVER-63773
SERVER-58673
SERVER-70633
SERVER-69785

both modified: src/mongo/s/query/cluster_aggregate.cpp
both modified: src/mongo/s/query/cluster_aggregation_planner.cpp
both modified: src/mongo/s/query/cluster_aggregation_planner.h
these files had irreconcilable differences and the changes had to be reimplemented by hand.
this is due to some refactoring and AggregationTargeter not existing

(cherry-picked from commit e424f3ed1266bcd772c52bcf23c518b2fe6c83ab)
(cherry-picked from commit ba8f9a28c8769dfe10f73b190c943ee4a57ee7a1)
(cherry-picked from commit 1219ff764c932f16a68c7ba1afd9b925f9f876c8)
Branch: v4.2
https://github.com/mongodb/mongo/commit/792f237975470ab61e3e9cd4b54ede87ed9377d8

Comment by Githook User [ 31/Oct/22 ]

Author:

{'name': 'Mickey. J Winters', 'email': 'mickey.winters@mongodb.com', 'username': 'mjrb'}

Message: SERVER-69003 backport per shard cursors

SERVER-62400
SERVER-62681
SERVER-62738
SERVER-63781
SERVER-63774
SERVER-63772
SERVER-63773
SERVER-58673
SERVER-70633
SERVER-69785

(cherry-picked from commit 4d5451b20fecaf4a054cca94de6962fcf28c577f)
(cherry-picked from commit ba8f9a28c8769dfe10f73b190c943ee4a57ee7a1)
(cherry-picked from commit 1219ff764c932f16a68c7ba1afd9b925f9f876c8)
Branch: v4.4
https://github.com/mongodb/mongo/commit/e424f3ed1266bcd772c52bcf23c518b2fe6c83ab

Comment by Githook User [ 07/Sep/22 ]

Author:

{'name': 'Mickey. J Winters', 'email': 'mickey.winters@mongodb.com', 'username': 'mjrb'}

Message: SERVER-69002 backport per shard cursors

SERVER-62400
SERVER-62681
SERVER-62738
SERVER-63781
SERVER-63774
SERVER-63772
SERVER-63773
SERVER-58673

(cherry-picked from commit fc54ebd0137a25ea664c022b51b685667dd037c7)
(cherry-picked from commit 53d7bceee61f73a1d6959edb5d490c3b338f3c0d)
(cherry-picked from commit 586663fec7c3a7d4a8b0185ff24825bd15e80dff)
(cherry-picked from commit ef2a62dcc27461d2be1b619c75bc04effa1f2021)
(cherry-picked from commit 4f3626ff4486e672569699dfde1cc0ae8c54d348)
(cherry-picked from commit 0f7683455bc06b153f14368a3f05f0b69671717e)
(cherry-picked from commit 11d01816f743d6764c4f12c42697f5edf813ce27)
(cherry-picked from commit 1fe77b5bd9fb13f9eb74275359dcc4ba69f2d5e9)
Branch: v5.0
https://github.com/mongodb/mongo/commit/4d5451b20fecaf4a054cca94de6962fcf28c577f

Comment by Githook User [ 10/Mar/22 ]

Author:

{'name': 'Jennifer Peshansky', 'email': 'jennifer.peshansky@mongodb.com', 'username': 'jenniferpeshansky'}

Message: SERVER-63773 Add getPostBatchResumeToken override to router_stage_merge
Branch: master
https://github.com/mongodb/mongo/commit/11d01816f743d6764c4f12c42697f5edf813ce27

Comment by Jennifer Peshansky (Inactive) [ 28/Feb/22 ]

 

The postBatchResumeToken is set here:

[j0:s0:prim] {"t":{"$date":"2022-02-25T22:21:44.273+00:00"},"s":"I",  "c":"QUERY",    "id":12345,   "ctx":"conn50","msg":"1 set _postBatchResumeToken in PlanExecutorPipeline::_performChangeStreamsAccounting()","attr":{"token":{"_data":"8262195678000000232B022C0100296E5A1004FEF6C5F64FA943CD8D3F88D05EB93176461E5F696400290004","_typeBits":{"$binary":{"base64":"QA==","subType":"0"}}}}}

But immediately afterwards, when this line is called, the token is empty:

[j0:s] {"t":{"$date":"2022-02-25T22:21:44.273+00:00"},"s":"I",  "c":"QUERY",    "id":12345,   "ctx":"conn20","msg":"set postBatchResumeToken in cluster_find::runGetMore For non-$changeStream agg","attr":{"token":{}}}

This might be related to SERVER-63772.

 

 

Comment by Jennifer Peshansky (Inactive) [ 24/Feb/22 ]

Slow query lines that exist in the PSC version that don't exist in the passing logs:
------------------------

createIndexes

[j3:s0:prim] | 2022-02-14T20:33:31.088+00:00 I  COMMAND  51803   [ShardServerCatalogCacheLoader::runCollAndChunksTasks] "Slow query","attr":{"type":"command","ns":"config.cache.chunks.change_stream_check_resumability.test","command":{"createIndexes":"cache.chunks.change_stream_check_resumability.test","indexes":[{"name":"lastmod_1","key":{"lastmod":1}}],"$db":"config"},"numYields":0,"reslen":300,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":6}},"ReplicationStateTransition":{"acquireCount":{"w":6}},"Global":{"acquireCount":{"r":4,"w":2}},"Database":{"acquireCount":{"r":4,"w":2}},"Collection":{"acquireCount":{"r":3,"w":2}},"Mutex":{"acquireCount":{"r":8}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":5},"storage":{},"protocol":"op_msg","durationMillis":109}

------------------------
create: test (and create:test below)

[j3:s0:prim] | 2022-02-14T20:33:33.145+00:00 I  COMMAND  51803   [conn50] "Slow query","attr":{"type":"command","ns":"change_stream_check_resumability.test","appName":"MongoDB Shell","command":{"create":"test","capped":false,"readConcern":{"level":"local","provenance":"implicitDefault"},"writeConcern":{"w":"majority","wtimeout":300321,"provenance":"clientSupplied"},"databaseVersion":{"uuid":{"$uuid":"94363c61-44b7-4cb9-a690-f0259dddc8bc"},"timestamp":{"$timestamp":{"t":1644870811,"i":43}},"lastMod":1},"lsid":{"id":{"$uuid":"9bca64e5-d96d-461a-8267-2b537b883a2b"},"uid":{"$binary":{"base64":"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1644870812,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"$configTime":{"$timestamp":{"t":1644870812,"i":1}},"$topologyTime":{"$timestamp":{"t":1644870804,"i":4}},"$client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"5.3.0-alpha4-31-g724d7f8-patch-620aa5c3850e61197d495bfc"},"os":{"type":"Linux","name":"Red Hat Enterprise Linux release 8.0 (Ootpa)","architecture":"x86_64","version":"Kernel 4.18.0-80.1.2.el8_0.x86_64"},"mongos":{"host":"ip-10-122-29-8.ec2.internal:20752","client":"127.0.0.1:58162","version":"5.3.0-alpha4-31-g724d7f8-patch-620aa5c3850e61197d495bfc"}},"$db":"change_stream_check_resumability"},"numYields":0,"reslen":236,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":1}},"ReplicationStateTransition":{"acquireCount":{"w":2}},"Global":{"acquireCount":{"r":1,"w":1}},"Database":{"acquireCount":{"w":1}},"Collection":{"acquireCount":{"r":1,"w":1}},"Mutex":{"acquireCount":{"r":3,"W":1}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":5},"readConcern":{"level":"local","provenance":"implicitDefault"},"writeConcern":{"w":"majority","wtimeout":300321,"provenance":"clientSupplied"},"storage":{},"remote":"127.0.0.1:42460","protocol":"op_msg","durationMillis":174}

[j3:s] | 2022-02-14T20:33:33.145+00:00 I  COMMAND  51803   [conn24] "Slow query","attr":{"type":"command","ns":"change_stream_check_resumability.test","appName":"MongoDB Shell","command":{"create":"test","lsid":{"id":{"$uuid":"9bca64e5-d96d-461a-8267-2b537b883a2b"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1644870811,"i":43}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"writeConcern":{"w":"majority","wtimeout":300321},"$db":"change_stream_check_resumability"},"numYields":0,"reslen":163,"readConcern":{"level":"local","provenance":"implicitDefault"},"remote":"127.0.0.1:58162","protocol":"op_msg","durationMillis":205}

------------------------

createIndexes

[j3:s0:prim] | 2022-02-14T20:33:33.384+00:00 I  COMMAND  51803   [ShardingDDLCoordinator-1] "Slow query","attr":{"type":"command","ns":"change_stream_check_resumability.test","command":{"createIndexes":"test","indexes":[{"key":{"_id":"hashed"},"name":"_id_hashed"}],"writeConcern":{"w":"majority"},"$db":"change_stream_check_resumability"},"numYields":0,"reslen":300,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":11}},"ReplicationStateTransition":{"acquireCount":{"w":15}},"Global":{"acquireCount":{"r":15,"w":3}},"Database":{"acquireCount":{"r":9,"w":2}},"Collection":{"acquireCount":{"r":7,"w":1,"R":1,"W":1}},"Mutex":{"acquireCount":{"r":15,"W":1}}},"flowControl":{"acquireCount":2,"timeAcquiringMicros":8},"writeConcern":{"w":"majority","wtimeout":0,"provenance":"clientSupplied"},"storage":{},"protocol":"op_msg","durationMillis":105}

------------------------

createIndexes

[j3:s0:prim] | 2022-02-14T20:33:33.764+00:00 I COMMAND 51803 [ShardServerCatalogCacheLoader::runCollAndChunksTasks] "Slow query","attr":{"type":"command","ns":"config.cache.chunks.change_stream_check_resumability.test","command":{"createIndexes":"cache.chunks.change_stream_check_resumability.test","indexes":[{"name":"lastmod_1","key":{"lastmod":1}}],"$db":"config"},"numYields":0,"reslen":300,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":6}},"ReplicationStateTransition":{"acquireCount":{"w":6}},"Global":{"acquireCount":{"r":4,"w":2}},"Database":{"acquireCount":{"r":4,"w":2}},"Collection":{"acquireCount":{"r":3,"w":2}},"Mutex":{"acquireCount":{"r":8}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":3},"storage":{},"protocol":"op_msg","durationMillis":260}

------------------------

 

The only commands that do exist in the non-PSC case are:
_shardsvrCreateCollection (and shardCollection below)
_shardsvrDropDatabase (and dropDatabase below)
_shardsvrCreateCollection (and shardCollection below)
 

Comment by Jennifer Peshansky (Inactive) [ 18/Feb/22 ]

When I print out the cursor:

{
"_batch" : [ ],
"_cursorid" : NumberLong("4075347561929758806"),
"_batchSize" : undefined,
"_maxAwaitTimeMS" : undefined,
"_txnNumber" : undefined,
"_ns" : "change_stream_check_resumability.test",
"_db" : change_stream_check_resumability,
"_collName" : "test",
...

And the function hasNext() is:

"hasNext" : function() {
   if (!this._batch.length) {
       if (!this._cursorid.compare(NumberLong("0"))) {
           return false;
       }       this._runGetMoreCommand();
   }   return this._batch.length > 0;
},

So it looks like the culprit here is the empty batch, and the fact that the cursor ID isn't 0. (A cursor ID of 0 is reserved to indicate the cursor has been closed, source). So, the batch likely should not be empty.

 

Generated at Thu Feb 08 05:58:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.