[CDRIVER-3563] Cursor may use an invalidated server description Created: 10/Mar/20  Updated: 27/Oct/23  Resolved: 27/Jun/23

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kevin Albertson Assignee: Kevin Albertson
Resolution: Gone away Votes: 0
Labels: FY21Q4, planned-maintenance-detectable-bug
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to NODE-3648 AbstractCursor getMores should run se... Closed
is related to CDRIVER-3404 Assertion hit when handshake runs aga... Closed
is related to CDRIVER-3653 Connections should use server descrip... Closed
is related to CDRIVER-3529 Do not mark server as Unknown during ... Closed
Epic Link: Improve Developer Experience
Quarter: FY24Q2
Case:

 Description   

Since a mongoc_cursor_t ties itself to a server_id, it is currently possible that it may attempt to send a command using a server description that has been marked as UNKNOWN.

I have been able to reproduce such a situation by modifying example-client.c. Here is the salient bit:

   printf ("initializing agg, this ties cursor to the server, but does not send the command\n");
   cursor = mongoc_collection_aggregate (collection, MONGOC_QUERY_NONE, &empty_doc, NULL /* opts */, NULL /* read_prefs */);
   if (mongoc_cursor_error (cursor, &error)) {
      printf ("agg error: %s\n", error.message);
   }
 
   printf ("simulating a server being marked as UNKNOWN\n");
   mongoc_topology_invalidate_server (client->topology, cursor->server_id, &error);
   printf ("done\n");
   
   printf ("sending aggregate command\n");
   mongoc_cursor_next (cursor, &doc);
   if (mongoc_cursor_error (cursor, &error)) {
      printf ("error: %s\n", error.message);
   }

This prints:

initializing agg, this ties cursor to the server, but does not send the command
simulating a server being marked as UNKNOWN
sending aggregate command
error: "aggregate" command does not support readConcern with wire version 0, wire version 4 is required

The test case forces a server being marked as UNKNOWN. But I believe this can happen in a two situations:
1. A > 4.0 server receives a "not master" error. The server description is marked as unknown, but the connections are left open. In <= 4.0 I don't believe this is an issue since the connections are also reset, so the next attempt to send a command on the cursor will recreate the connection and do another handshake.
2. The background monitor receives a network error and marks the server as unknown. Because of CDRIVER-3529, this is more likely to happen on a transient network error.

This bug is extremely similar to CDRIVER-3404. I think it is worth investigating if this could surface in other parts of the codebase, since we have other wire version checks. I think it's worth considering rethinking how we're invalidating server descriptions / doing wire version checks.



 Comments   
Comment by Kevin Albertson [ 27/Jun/23 ]

I am unable to reproduce the same error in C driver 1.24.1 with these tests: https://github.com/mongodb/mongo-c-driver/compare/master...kevinAlbs:mongo-c-driver:C3563?expand=1

CDRIVER-3653 changed to use the server description obtained from the connection handshake rather than the shared topology description. Callers of mongoc_cluster_stream_for_server get a mongoc_server_stream_t with the server description obtained from the connection handshake.

Comment by Jeremy Mikola [ 11/Jun/20 ]

This may be relevant to Sending an equivalent command for the second attempt in the Retryable Reads spec, which addresses whether drivers can simply send the same command to the newly selected server for the retry attempt.

I think this (and the issue at hand) is relevant to any helper that creates a synthetic cursor before communicating with the server. With respect to aggregate, the server selection is just happening to perform wire version checks and validate options, the errors from which are relayed through the first mongoc_cursor_error call before mongoc_cursor_next kicks off the command.

I don't believe it applies to the traditional find case, as that cursor is derived from a server response and we needn't worry about retrying getMore commands today.

Generated at Wed Feb 07 21:18:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.