Uploaded image for project: 'Ruby Driver'
  1. Ruby Driver
  2. RUBY-1169

Stale Replica Set Config After Failover

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.3.1
    • Fix Version/s: 12_01_17
    • Component/s: Server Selection
    • Labels:
      None
    • Environment:
      Rails 4.2.7.1, Mongoid 5.1.4, MongoDB 2.6.12
    • Sprint:
      Ruby Sprint 44

      Description

      After we perform a failover in our cluster (rs.stepDown) our app throws an error "not master and slaveOk=false". Looking closer, failover seems to happen normally (we have 9 separate replica sets and therefore clients configured)

      ```
      D, 2016-11-03T11:36:31.299424 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:31.406014 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:31.410145 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:32.053781 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:32.055543 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:32.360623 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:32.464632 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:33.145646 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:33.147382 #9937 DEBUG – : MONGODB | IOError
      D, 2016-11-03T11:36:41.172013 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:41.586878 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:41.599024 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:41.949060 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:42.090859 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:42.312174 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:42.707659 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:43.195367 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      D, 2016-11-03T11:36:43.198079 #9937 DEBUG – : MONGODB | Server 1234567890-a0.domain.com:51936 elected as primary in rs-1234567890.
      ```

      However on seemingly random queries after the failover (maybe 5%?) we get the above mentioned error. The error occurs because the mongo driver is returning a stale replica set configuration during those certain queries from our app.

      So if I throw in a `puts config` at https://github.com/mongodb/mongo-ruby-driver/blob/master/lib/mongo/server/description.rb#L443 I get a stream of statements showing that that on all 9 clients the correct primary is set to `1234567890-a0.domain.com:51936` which is correct:

      ```

      {"setName"=>"rs-1234567890", "setVersion"=>6, "ismaster"=>false, "secondary"=>true, "hosts"=>["1234567890-a1.domain.com:51936", "1234567890-a0.domain.com:51936"], "arbiters"=>["1234567890-ar0.domain.com:51936"], "primary"=>"1234567890-a0.domain.com:51936", "me"=>"1234567890-a1.domain.com:51936", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2016-11-03 18:54:49 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}

      ```

      However, if I hit our app, certain queries produce the following config object in the logs and try to hit the old primary (now secondary):
      ```

      {"setName"=>"rs-1234567890", "setVersion"=>6, "ismaster"=>false, "secondary"=>true, "hosts"=>["1234567890-a0.domain.com:51936", "1234567890-a1.domain.com:51936"], "arbiters"=>["1234567890-ar0.domain.com:51936"], "primary"=>"1234567890-a1.domain.com:51936", "me"=>"1234567890-a0.domain.com:51936", "maxBsonObjectSize"=>16777216, "maxMessageSizeBytes"=>48000000, "maxWriteBatchSize"=>1000, "localTime"=>2016-11-03 17:42:25 UTC, "maxWireVersion"=>2, "minWireVersion"=>0, "ok"=>1.0}

      ```

      notice that `localTime` is significantly behind the time of the "good config" - no matter how many times I try to hit that endpoint, I will get get the same old primary/old config back from the mongo driver.

        Attachments

          Activity

            People

            Assignee:
            Unassigned
            Reporter:
            lddemi Luke Demi
            Participants:
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved: