[CDRIVER-2657] Coverity analysis defect 103440: LOCK Created: 21/May/18  Updated: 28/Oct/23  Resolved: 05/Jul/18

Status: Closed
Project: C Driver
Component/s: None
Affects Version/s: None
Fix Version/s: 1.12.0

Type: Bug Priority: Minor - P4
Reporter: Coverity Collector User Assignee: Roberto Sanchez
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Epic Link: Resolve Coverity Warnings

 Description   

LOCK

Defect 103440 (STATIC_C)
Checker LOCK (subcategory double_unlock)
File: /data/mongo-c-driver/src/mongoc/mongoc-topology.c
Function mongoc_topology_select_server_id
/data/mongo-c-driver/src/mongoc/mongoc-topology.c, line: 693
"_mongoc_topology_do_blocking_scan" unlocks "topology->mutex" while it is unlocked.

                _mongoc_topology_do_blocking_scan (topology, &scanner_error);



 Comments   
Comment by Roberto Sanchez [ 08/Jul/18 ]

Thanks for the confirmation. I went ahead and marked it as a false positive.

Comment by A. Jesse Jiryu Davis [ 08/Jul/18 ]

That's exactly right, Coverity is worried about a sleep() instigated by test_topology_scanner_blocking_initiator to simulate a slow connection. We've marked this warning as a false positive before. It seems that your code changes altered the scenario enough that Coverity doesn't recognize that we've suppressed this warning already.

Comment by Roberto Sanchez [ 08/Jul/18 ]

jesse, kevin.albertson, it looks like the change I made resulted in a new Coverity defect:

http://coverity.mongodb.com/query/defects.htm?project=C+Driver&cid=103668

It looks like the real culprit is test_topology_scanner_blocking_initiator() in test-mongoc-topology-scanner.c. The question I have is whether this should be fixed (e.g., perhaps by changing the mechanics of the lock/unlock behavior), or whether we would consider this an artifact of the test (e.g., the sleep call is to simulate some sort of delay that would be encountered in the real world).

Comment by Githook User [ 05/Jul/18 ]

Author:

{'username': 'rcsanchez97', 'name': 'Roberto C. Sánchez', 'email': 'roberto@connexer.com'}

Message: CDRIVER-2657 fix double unlock defect
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/dab078a653eb33036b348ceb54f23e653bea4c0e

Comment by Roberto Sanchez [ 29/Jun/18 ]

And after further investigation I found that Coverity first detected this defect on March 24th, or about 6 weeks before the change I identified, not in mid-May as I had originally thought. That change is clearly not the cause.

Comment by Roberto Sanchez [ 29/Jun/18 ]

My apologies for not communicating myself more clearly. Prior to your change if the mutex was locked on entry to mongoc_topology_select_server_id then the call to _mongoc_topology_do_blocking_scan, which then results in a call to mongoc_topology_scan_once does not look like it would cause the double unlock. However, after your change, a mutex that was locked on entry would be left unlocked prior to the call to _mongoc_topology_do_blocking_scan and then result in a double unlock. It seems the question I should have asked was "is it assumed (or possible) to enter mongoc_topology_select_server_id with the mutex locked?"

Either way, I will take another closer look, as it is entirely likely that I am just confused about the proper flow.

Comment by Kevin Albertson [ 29/Jun/18 ]

However, the change to mongoc_topology_select_server_id in commit bf80fd809c breaks that assumption.

That change you're referring to adds both a lock/unlock:

+   mongoc_mutex_lock (&topology->mutex);
+   /* It isn't strictly necessary to lock here, because if the topology
+    * is invalid, it will never become valid. Lock anyway for consistency. */
    if (!mongoc_topology_scanner_valid (ts)) {
       if (error) {
          mongoc_topology_scanner_get_error (ts, error);
          error->domain = MONGOC_ERROR_SERVER_SELECTION;
          error->code = MONGOC_ERROR_SERVER_SELECTION_FAILURE;
       }
-
+      mongoc_mutex_unlock (&topology->mutex);
       return 0;
    }
+   mongoc_mutex_unlock (&topology->mutex);

It didn't lock here before, so I don't think this is where the double unlock was introduced. But then again, if coverity just found this recently, then maybe I'm missing something.

I think we should lock the mutex in _mongoc_topology_do_blocking_scan as Jesse suggested. I also agree with Jesse's suggestion that mongoc_topology_scan_once shouldn't assume the mutex is locked, then leave it in an unlocked state, unless there's a really good reason.

Comment by A. Jesse Jiryu Davis [ 28/Jun/18 ]

It appears to me, based on the code in _mongoc_topology_run_background being older, that the expectation is for the mutex to be locked on entry to mongoc_topology_scan_once.

I agree. Let's document that assumption in a comment block.

Additionally, scan_once looks buggy to me: it assumes the mutex will be locked when we enter scan_once, and then scan_once unlocks the mutex. I think that function should restore the mutex to its locked state before returning. Let's do that and check if scan_once's callers need to change in response.

Finally, in order to obey scan_once's assumption that the mutex is locked, I think we should lock around the scan_once call in _mongoc_topology_do_blocking_scan: that's in single-threaded mode, so it's unnecessary but harmless to lock there for the sake of a consistent state when entering scan_once.

Comment by Roberto Sanchez [ 27/Jun/18 ]

jesse, I have done some analysis of this ticket and here are my findings.

First, the issue appears to be that mongoc_topology_scan_once makes an unconditional call to mongoc_mutex_unlock:

static void
mongoc_topology_scan_once (mongoc_topology_t *topology, bool obey_cooldown)
{
   /* since the last scan, members may be added or removed from the topology
    * description based on ismaster responses in connection handshakes, see
    * _mongoc_topology_update_from_handshake. retire scanner nodes for removed
    * members and create scanner nodes for new ones. */
   mongoc_topology_reconcile (topology);
   mongoc_topology_scanner_start (topology->scanner, obey_cooldown);
 
   /* scanning locks and unlocks the mutex itself until the scan is done */
   mongoc_mutex_unlock (&topology->mutex);
   mongoc_topology_scanner_work (topology->scanner);
 
   mongoc_mutex_lock (&topology->mutex);
 
   _mongoc_topology_scanner_finish (topology->scanner);
 
   topology->last_scan = bson_get_monotonic_time ();
   topology->stale = false;
   mongoc_mutex_unlock (&topology->mutex);
}

That function, mongoc_topology_scan_once, is called from only two places: _mongoc_topology_do_blocking_scan and _mongoc_topology_run_background. There does not appear to be any locking in _mongoc_topology_do_blocking_scan, while there is locking in _mongoc_topology_run_background. Commit bf80fd809c (CDRIVER-2286 fix rare race in topology scanner) from Kevin made this change:

diff --git a/src/mongoc/mongoc-topology.c b/src/mongoc/mongoc-topology.c
index 5f2ddbbd0..060dc3dd5 100644
--- a/src/mongoc/mongoc-topology.c
+++ b/src/mongoc/mongoc-topology.c
@@ -625,15 +625,19 @@ mongoc_topology_select_server_id (mongoc_topology_t *topology,
    BSON_ASSERT (topology);
    ts = topology->scanner;
 
+   mongoc_mutex_lock (&topology->mutex);
+   /* It isn't strictly necessary to lock here, because if the topology
+    * is invalid, it will never become valid. Lock anyway for consistency. */
    if (!mongoc_topology_scanner_valid (ts)) {
       if (error) {
          mongoc_topology_scanner_get_error (ts, error);
          error->domain = MONGOC_ERROR_SERVER_SELECTION;
          error->code = MONGOC_ERROR_SERVER_SELECTION_FAILURE;
       }
-
+      mongoc_mutex_unlock (&topology->mutex);
       return 0;
    }
+   mongoc_mutex_unlock (&topology->mutex);
 
    heartbeat_msec = topology->description.heartbeat_msec;
    local_threshold_ms = topology->local_threshold_msec;
@@ -1449,3 +1453,13 @@ _mongoc_topology_end_sessions_cmd (mongoc_topology_t *topology, bson_t *cmd)

It would appear that the addition of mongoc_mutex_unlock in that change resulted in the double unlock behavior. Back over in _mongoc_topology_run_background there is this bit near the top:

   /* we exit this loop when shutdown_requested, or on error */
   for (;;) {
      /* unlocked after starting a scan or after breaking out of the loop */
      mongoc_mutex_lock (&topology->mutex);
      if (!mongoc_topology_scanner_valid (topology->scanner)) {
         mongoc_mutex_unlock (&topology->mutex);
         goto DONE;
      }

The end of loop and function look like this:

      topology->scan_requested = false;
      mongoc_topology_scan_once (topology, false /* obey cooldown */);
 
      last_scan = bson_get_monotonic_time ();
   }
 
DONE:
   return NULL;
}

It appears to me, based on the code in _mongoc_topology_run_background being older, that the expectation is for the mutex to be locked on entry to mongoc_topology_scan_once. However, the change to mongoc_topology_select_server_id in commit bf80fd809c breaks that assumption.

Would it be possible to remove the call to mongoc_mutex_unlock outside of the if statement? kevin.albertson, do you think that would break the change you made?

As an additional note, it appears that I require a Synopsys login so that I can view the Coverity defect report. Also, is there a documented process for reproducing the checks locally or via patch builds? That would allow me to experiment with making tweaks to determine if they fix the problem and also identify regressions that might arise.

Generated at Wed Feb 07 21:15:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.