[CDRIVER-2552] Race condition causes assert failure after secondary removed from replica set Created: 15/Mar/18  Updated: 28/Oct/23  Resolved: 12/Apr/18

Status: Closed
Project: C Driver
Component/s: libmongoc, replset
Affects Version/s: 1.5.0
Fix Version/s: 1.10.0

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Case:

 Description   

I made a mistake implementing CDRIVER-562, which can result in an assert failure when connected to a replica set. I'm not certain what race condition, or perhaps replica set misconfiguration, can cause it, but I can describe the sequence of events and a fix anyway.

1. Driver discovers a replica set with a primary A and at least one secondary B
2. Driver opens a new connection to A (either because of a prior disconnect in single mode, or while expanding the pool in pooled mode)
3. Driver handshakes the new connection by calling isMaster on A
4. A, for some reason, suddenly does not include B in its host list, perhaps because B was removed from the configuration, or because of some persistent misconfiguration
5. The driver marks B as "retired" so it can be removed from the topology description soon (this is my implementation of updateRSFromPrimary in the Server Discovery and Monitoring Spec, with a simplification to avoid a crash, CDRIVER-789).
6. The driver's connection to B is disconnected somehow.
7. During the next scan, in mongoc_topology_scanner_node_setup, the driver sees that B is disconnected, and before it begins reconnecting, it asserts that B is not marked "retired"

The point of the assert was to validate my understanding of the "retired" field. I thought it was always cleared after a scan, I didn't realize the side effect of CDRIVER-562, which I implemented recently, which allowed "retired" to be set before the next scan began.

The solution is probably, at the beginning of each scan, to remove any nodes that have been retired by handshakes. The assert can remain in place.

In addition, there had been a bug not only when
_mongoc_topology_update_from_handshake retires a node, but also when it adds
one. The newly-added node has a mongoc_async_cmd_t created for it, even though
it's outside the scanner loop. Therefore in mongoc_topology_scanner_start the
assert (!node-cmd) fails. This was fixed as a side effect of CDRIVER-1972 during 1.10 development.

This fix will not be backported.



 Comments   
Comment by Githook User [ 24/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 don't reconcile scanner nodes from handshake

Simpler solution: don't risk retiring a scanner node or creating an async_cmd_t
while processing a handshake at all. Wait until we're about to scan before
reconciling scanner nodes with the updated topology description.

Also factor the steps required to start a scan into a new function,
mongoc_topology_scan_once.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/1e25a1d72e69c91b02cbc85731624b957bff9bc8

Comment by Githook User [ 24/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 test adding node from handshake
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/ef895faadfdc4c6a447ccf0a94b26541a91f8975

Comment by Githook User [ 21/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 rare assert failure during RS reconfig, r1.6
Branch: CDRIVER-2552-on-r1.6
https://github.com/mongodb/mongo-c-driver/commit/a101fd1d89e76118955079bb75097180536f8fa5

Comment by Githook User [ 21/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 test add secondary in handshake, r1.6
Branch: CDRIVER-2552-on-r1.6
https://github.com/mongodb/mongo-c-driver/commit/e97a567555eddf96ea7cb9ff9e0b09ccfa692036

Comment by Githook User [ 21/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 test del secondary in handshake, r1.6
Branch: CDRIVER-2552-on-r1.6
https://github.com/mongodb/mongo-c-driver/commit/14b1a2065806cf6dac3cfb65c231da27f464464c

Comment by Githook User [ 19/Mar/18 ]

Author:

{'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis'}

Message: CDRIVER-2552 rare assert failure during RS reconfig
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/e1b2861946fcb8dd278a0d6aeb596da965c4698b

Generated at Wed Feb 07 21:15:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.