[SERVER-65317] mongod removes connection from connection pool after running simple $search query Created: 07/Apr/22  Updated: 29/Oct/23  Resolved: 07/Jul/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.2, 6.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kevin Rosendahl Assignee: Amirsaman Memaripour
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.0
Steps To Reproduce:

Run an aggregation pipeline with a $search then a $limit where the $search would return many documents.

For example, add 2500 identical documents { a: "hello" } to a collection test.search with a search index definition with { dynamic: true }, and run

db.search.aggregate([ { $search: { text: { path: "a", query: "hello" }} },  {$limit: 10}])

This doesn't reproduce 100% of the time, but most executions should invoke the behavior.

Sprint: Service Arch 2022-05-16, Service Arch 2022-06-13, Service Arch 2022-06-27, Service Arch 2022-07-11
Participants:

 Description   

We've witnessed mongod log lines similar to the following when running simple $search queries.

{
  "t":{"$date":"2022-04-06T23:43:01.391+00:00"},
  "s":"I",
  "c":"CONNPOOL", "id":22566,
  "ctx":"MongotExecutor",
  "msg":"Ending connection due to bad connection status",
  "attr": {
    "hostAndPort":"mongot1:27027",
    "error": "CallbackCanceled: Callback was canceled",
    "numOpenConns":1
  }
}

This is most easily reproducible when running a $search aggregation with a small $limit.

We believe this may be related to TaskExecutorCursor's eager retrieval of a second batch of results, even if the first batch has not been fully consumed.

When running $search, mongod will immediately issue a search command to mongot, and mongot will return with a batch of 101 results. While the query layer is processing the results from the first batch, TaskExecutorCursor will fire off a getMore command on the batch.

It appears that if the query completes (e.g. can fulfill the _id lookup and $limit using the first batch) while this getMore is outstanding, that when returning the connection used for the getMore to the pool, the connection will be seen as having an error, and the connection will be closed.

mongod will then have to open a new connection to run killCursor for the TaskExecutorCursor. The logs below appear to show such a case.

mongot | 2022-04-06T22:09:21.844+0000 T MONGOT [c.x.m.s.wireprotocol.SearchCommand] Received search
mongot | 2022-04-06T22:09:21.850+0000 T MONGOT [c.x.m.s.wireprotocol.GetMoreCommand] Received getMore
mongod | {"t":{"$date":"2022-04-06T22:09:21.866+00:00"},"s":"I",  "c":"CONNPOOL", "id":22576,   "ctx":"MongotExecutor","msg":"Connecting","attr":{"hostAndPort":"mongot1:27027"}}
mongod | {"t":{"$date":"2022-04-06T22:09:21.866+00:00"},"s":"I",  "c":"CONNPOOL", "id":22566,   "ctx":"MongotExecutor","msg":"Ending connection due to bad connection status","attr":{"hostAndPort":"mongot1:27027","error":"CallbackCanceled: Callback was canceled","numOpenConns":1}}
mongot | 2022-04-06T22:09:21.867+0000 I MONGOT [c.x.m.s.w.WireProtocolQueryServer] Closing connection 0242acfffe160004-0000000c-00000015-3263563517157b20-68a2af4b
mongot | 2022-04-06T22:09:21.867+0000 I MONGOT [c.x.m.s.w.o.op.OperationManager] killing ops associated with channel 0242acfffe160004-0000000c-00000015-3263563517157b20-68a2af4b
mongot | 2022-04-06T22:09:21.867+0000 I MONGOT [c.x.m.s.w.o.op.OperationManager] killing operation: 135
mongot | 2022-04-06T22:09:21.874+0000 I MONGOT [c.x.m.s.w.WireProtocolQueryServer] Received connection 0242acfffe160004-0000000c-00000016-185a4190571511cf-619ce6d9 from /172.22.0.3:58152
mongot | 2022-04-06T22:09:21.883+0000 T MONGOT [c.x.m.s.w.KillCursorsCommand] Received killCursors

This does not appear to produce a correctness issue, but does create a lot of connection churn between mongod and mongot, and the log lines can be concerning for users.

We were able to reproduce this on at least 5.0 and 5.2.

Acceptance Criteria: Write a test that reproduces this behavior (use the Steps To Reproduce field) then fix the behavior, using the test as a way to prove correctness.



 Comments   
Comment by Githook User [ 09/Aug/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-65317 `TaskExecutorCursor` should not cancel prefetching getMores

(cherry picked from commit a7ce4c0c9b3abf5bc27675a4b5edde401371a2fd)
Branch: v6.0
https://github.com/mongodb/mongo/commit/05a3cac3162258651b57b0c1e4d40cbac85aa38f

Comment by Githook User [ 07/Jul/22 ]

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-65317 `TaskExecutorCursor` should not cancel prefetching getMores
Branch: master
https://github.com/mongodb/mongo/commit/a7ce4c0c9b3abf5bc27675a4b5edde401371a2fd

Generated at Thu Feb 08 06:02:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.