[SERVER-63751] Error when a sleep is added to AsyncResultsMerger::_processBatchResults Created: 16/Feb/22  Updated: 27/Oct/23  Resolved: 10/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jennifer Peshansky (Inactive) Assignee: Jennifer Peshansky (Inactive)
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: QE 2022-03-21
Participants:

 Description   

With this diff:

diff --git a/src/mongo/s/query/async_results_merger.cpp b/src/mongo/s/query/async_results_merger.cpp
index 9100df3c57d..3025e627f0e 100644
--- a/src/mongo/s/query/async_results_merger.cpp
+++ b/src/mongo/s/query/async_results_merger.cpp
@@ -761,6 +761,15 @@ void AsyncResultsMerger::_processBatchResults(WithLock lk,
     // Update the cursorId; it is sent as '0' when the cursor has been exhausted on the shard.
     remote.cursorId = cursorResponse.getCursorId();
 
+    // Adding sleep to repro SERVER-31978
+    if (remote.cursorId == 0 && _tailableMode == TailableModeEnum::kTailableAndAwaitData) {
+        std::cout <<
+              "AsyncResultsMerger()::_processBatchResults going to sleep" << std::endl;
+        sleepmillis(5000);
+        std::cout <<
+              "AsyncResultsMerger()::_processBatchResults woke up" << std::endl;
+    }
+
     // Save the batch in the remote's buffer.
     if (!_addBatchToBuffer(lk, remoteIndex, cursorResponse)) {
         return;

several tests fail with the error:
uncaught exception: Error: [0] != [0] are equal : Cursor has been closed unexpectedly. Observed change stream events: [ null ] :

This only happens in the change_streams_per_shard_cursors_passthrough suite.

The failing tests are:
whole_db_resumability
apply_ops
lookup_post_image
whole_db_metadata_notifications
metadata_notifications

Filing this ticket to investigate the root cause.



 Comments   
Comment by Jennifer Peshansky (Inactive) [ 10/Mar/22 ]

This doesn't reproduce anymore, now that all the other bug fixes are in. Here's an evergreen patch running this repro against current master. Closing this ticket as "gone away."

Comment by Mickey Winters [ 28/Feb/22 ]

long story short I confirmed what bernard said and if there is a long enough delay (over 30s ish) between mongos sending a getmore and it actually getting a response, mongos will return NetworkInterfaceExceededTimeLimit to a driver. so that's not a problem, but it doesn't explain there error this ticket saw which is a delay happening internally on mongos

Comment by Mickey Winters [ 28/Feb/22 ]

ah, I found a bug in the code I pasted above. I am turning off the fail point on mongos when it was turned on on mongod

Comment by Mickey Winters [ 24/Feb/22 ]

// this code snippet had a bug

simulating delay on mongod side
fails on the first assert.soon(()=>c.hasNext()). it doesn't seem to get any of the event after turning off the failpoint

Generated at Thu Feb 08 05:58:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.