Uploaded image for project: 'PHP Driver: Extension'
  1. PHP Driver: Extension
  2. PHPC-1152

Aggregate may fail to use the same session for getMore commands

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 1.4.3
    • Affects Version/s: 1.4.0
    • Component/s: None
    • Labels:
      None
    • Environment:
      PHPC 1.4.2, PHPLIB 1.3.1, PHP 7.1.15
      MongoDB 3.6.3 replica set using FCV 3.6 and authentication

      In mongodb/mongo-php-library#517, a user reported a case where the getMore for an aggregate command produces the following server error:

      Cursor session id (142df95e-b30c-409b-ba3c-213d19c8bbee - TDTj2DYnZ/sCdyr5i0hFB2XrtJGBm4wdvewCHRKZkn0=) is not the same as the operation context's session id (55fa9062-7ad7-4416-a169-4b10415a3e5e - TDTj2DYnZ/sCdyr5i0hFB2XrtJGBm4wdvewCHRKZkn0=)

      This error was alluded to in CDRIVER-2506, although that issue was resolved in libmongoc 1.9.3, which is included in our 1.4.2 release. As the user states, this error only appears with the aggregate command. I was also able to reproduce this in a local environment, so perhaps CDRIVER-2506 did not fully resolve the issue (or there was a problem with the backport to 1.9.x).

      Consider the user's script:

      $client = new \MongoDB\Client(REPLICASET_AUTH_URI);
      
      // test.test collection contains > 101 documents
      $cursor = $client->test->test->find();
      $recordsScanned = 0;
      foreach ($cursor as $row) {
          ++$recordsScanned;
          echo 'A:' . $recordsScanned . "\n";
      }
      
      $cursor = $client->test->test->aggregate([]);
      $recordsScanned = 0;
      foreach ($cursor as $row) {
          ++$recordsScanned;
          echo 'B:' . $recordsScanned . "\n";
      }
      

      The had the following observations:

      • The error still happens if the first query find() is replaced with aggregate()
      • The error does not happen if the second query aggregate() is replaced with find()
      • The error does not happen if the second query stores result into a differently named variable (e.g. $cursor2 instead of $cursor)

      I believe the third bullet is related to the first cursor being freed and releasing the implicit session back to the LIFO session pool. When different variable names are used for each cursor, the find cursor is not freed until the end of the script and its session remains checked out of the pool. That allows the aggregate cursor to select a new session from the pool and re-use that session for its getMore.

      When both cursor variables use the same name, the first cursor is not freed until after the second cursor is already assigned. This means that the aggregate command gets a new session from the pool; however, its getMore operation will then get the session originally used for the first operation (it was more recently returned to the pool – LIFO).


      In my own testing, I saw the error occur even if both operations used explicit sessions (i.e. "session" option provided) sessions. This revealed what I believe to be a root cause of the issue: the getMore for aggregate is not guaranteed to reuse the same session as its original original aggregate command. This appears to be unique for aggregate (which we execute using executeCommand() and friends). I could not reproduce the issue with find commands (which use executeQuery()).

      Consider the following script (where "test.test" contains > 101 documents):

      class SessionLogger implements MongoDB\Driver\Monitoring\CommandSubscriber
      {
          public function commandStarted(MongoDB\Driver\Monitoring\CommandStartedEvent $event)
          {
              if (isset($event->getCommand()->lsid)) {
                  printf("%s command with lsid: %s\n", $event->getCommandName(), bin2hex((string) $event->getCommand()->lsid->id));
              }
          }
      
          public function commandSucceeded(MongoDB\Driver\Monitoring\CommandSucceededEvent $event)
          {
          }
      
          public function commandFailed(MongoDB\Driver\Monitoring\CommandFailedEvent $event)
          {
          }
      }
      
      MongoDB\Driver\Monitoring\addSubscriber(new SessionLogger);
      
      $client = new \MongoDB\Client(REPLICASET_AUTH_URI);
      
      echo "Testing find commands with implicit sessions\n";
      $cursor = $client->test->test->find([]);
      $cursor->toArray();
      $cursor = $client->test->test->find([]);
      $cursor->toArray();
      
      echo "\nTesting find commands with explicit sessions\n";
      $session = $client->startSession();
      printf("Explicit session with lsid: %s\n", bin2hex((string) $session->getLogicalSessionId()->id));
      $cursor = $client->test->test->find([], ['session' => $session]);
      $cursor->toArray();
      $cursor = $client->test->test->find([], ['session' => $session]);
      $cursor->toArray();
      
      echo "\nTesting aggregate commands with implicit sessions\n";
      $cursor = $client->test->test->aggregate([]);
      try {
          $cursor->toArray();
      } catch (MongoDB\Driver\Exception\RuntimeException $e) {
          printf("RuntimeException(%d): %s\n", $e->getCode(), $e->getMessage());
      }
      $cursor = $client->test->test->aggregate([]);
      try {
          $cursor->toArray();
      } catch (MongoDB\Driver\Exception\RuntimeException $e) {
          printf("RuntimeException(%d): %s\n", $e->getCode(), $e->getMessage());
      }
      
      echo "\nTesting aggregate commands with explicit sessions\n";
      $session = $client->startSession();
      printf("Explicit session with lsid: %s\n", bin2hex((string) $session->getLogicalSessionId()->id));
      $cursor = $client->test->test->aggregate([], ['session' => $session]);
      try {
          $cursor->toArray();
      } catch (MongoDB\Driver\Exception\RuntimeException $e) {
          printf("RuntimeException(%d): %s\n", $e->getCode(), $e->getMessage());
      }
      $cursor = $client->test->test->aggregate([], ['session' => $session]);
      try {
          $cursor->toArray();
      } catch (MongoDB\Driver\Exception\RuntimeException $e) {
          printf("RuntimeException(%d): %s\n", $e->getCode(), $e->getMessage());
      }
      

      This produces the following output:

      Testing find commands with implicit sessions
      find command with lsid: f8fedb095d6a4985af13b53a425c8d32
      getMore command with lsid: f8fedb095d6a4985af13b53a425c8d32
      find command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      getMore command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      
      Testing find commands with explicit sessions
      Explicit session with lsid: f8fedb095d6a4985af13b53a425c8d32
      find command with lsid: f8fedb095d6a4985af13b53a425c8d32
      getMore command with lsid: f8fedb095d6a4985af13b53a425c8d32
      find command with lsid: f8fedb095d6a4985af13b53a425c8d32
      getMore command with lsid: f8fedb095d6a4985af13b53a425c8d32
      
      Testing aggregate commands with implicit sessions
      aggregate command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      getMore command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      aggregate command with lsid: 11e88ca9f7b548d79e8e95511a1e6a22
      getMore command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      RuntimeException(13): Cursor session id (11e88ca9-f7b5-48d7-9e8e-95511a1e6a22 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=) is not the same as the operation context's session id (625e9fdc-4d33-48e0-8cf1-e9e359ba17a1 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=)
      
      Testing aggregate commands with explicit sessions
      Explicit session with lsid: 11e88ca9f7b548d79e8e95511a1e6a22
      aggregate command with lsid: 11e88ca9f7b548d79e8e95511a1e6a22
      killCursors command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      getMore command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      RuntimeException(13): Cursor session id (11e88ca9-f7b5-48d7-9e8e-95511a1e6a22 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=) is not the same as the operation context's session id (625e9fdc-4d33-48e0-8cf1-e9e359ba17a1 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=)
      aggregate command with lsid: 11e88ca9f7b548d79e8e95511a1e6a22
      killCursors command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      getMore command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      RuntimeException(13): Cursor session id (11e88ca9-f7b5-48d7-9e8e-95511a1e6a22 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=) is not the same as the operation context's session id (625e9fdc-4d33-48e0-8cf1-e9e359ba17a1 - pzarhKY2owve6qhN2jyojBF8O3wMotRKbXhCmcmZSlE=)
      killCursors command with lsid: 625e9fdc4d3348e08cf1e9e359ba17a1
      

      The killCursors output is the result of a previous cursor object being freed because it has a non-zero cursor ID. You'll note that the session ID used for killCursors corresponds to the previous getMore responsible for the preceding exception.

      Here we see that find cursors have no problem reusing the correct session ID for their getMore commands; however, aggregate cursors seem to check a new implicit session out of the LIFO pool each time, regardless of whether the original aggregate used an implicit or explicit session.

            Assignee:
            jmikola@mongodb.com Jeremy Mikola
            Reporter:
            jmikola@mongodb.com Jeremy Mikola
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: