[CDRIVER-4077] OP_KILLCURSORS incorrectly used to destroy change stream during resume attempt Created: 13/Jul/21 Updated: 15/Apr/22 |
|
| Status: | Backlog |
| Project: | C Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Unknown |
| Reporter: | Kaitlin Mahar | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Description |
|
It appears that there may be a libmongoc bug regarding tracking of maxWireVersion for servers when connected to a sharded cluster. I am seeing this specifically with a single-mongos cluster backed by 3 replica sets, started via mlaunch with: mlaunch init --replicaset --sharded 3 --setParameter enableTestCommands=1 We have a Swift test which does the following, to test change streams' automatic resume behavior: , "data":{"errorCode":10107,"failCommands":["getMore"],"errorLabels":["ResumableChangeStreamError"]}} On server latest (I first saw this on v5.0.0-alpha0-1541-ga8cf4f3 and have observed it on newer versions as well), this test started to fail as an extra aggregate attempt was observed, as the first resume attempt would consistently fail with an error from libmongoc with the domain MONGOC_ERROR_STREAM: "Failed to send \"aggregate\" command with database \"test\": Failed to read 4 bytes: socket error or timeout". Of note, is that as part of the resume process drivers including libmongoc attempt to kill the original cursor. I noticed that the server recently merged in Specifically, here server_stream->sd->max_wire_version is incorrectly 0, so the else block is hit. I first witnessed this with latest + libmongoc-1.18.0-alpha2, however I have now tested as far back as server 4.4.3 + libmongoc 1.16.2 and saw that the branch using OP_KILLCURSORS is also used on those in this particular scenario. However this was not an issue until just now since the earlier server versions would still accept OP_KILLCURSORS without closing the connection. I'll also note this seems to somehow be related to this particular code path, as from my printf testing cleaning up a change stream normally via mongoc_change_stream_destroy does not appear to take the OP_KILLCURSORS path. Let me know if you need any more information or help reproducing this. Slack thread for context: https://mongodb.slack.com/archives/C72LB5RPV/p1626133519311700 |