Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8476

slaveDelay with Ghostsync

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.3.2
    • Component/s: Replication
    • Labels:
    • Environment:
      CentOS6.2 x86_64
    • Replication
    • ALL

      Problem

      We met a serious problem of getting STALE DATA.

      This problem comes from slaveDelay and Ghost sync.

      Situation

      Replica set members

      "members" : [
      {  "_id" : 0,
         "host" : "192.168.159.133:27017",
         "priority" : 2
      },{"_id" : 1,
         "host" : "192.168.159.134:27017"
      },{"_id" : 2,
         "host" : "192.168.159.135:27017",
         "priority" : 0,
         "slaveDelay" : 300
      }]
      

      Problem1 : syncFrom

      rs.syncFrom('192.168.159.135:27017')
      {
         "syncFromRequested" : "192.168.159.135:27017",
         "warning" : "requested member is more than 10 seconds behind us",
         "prevSyncTarget" : "192.168.159.133:27017",
         "ok" : 1
      }
      

      I can see this warnings, if we set the miss settings.
      But we won't get this warnings, when this replica set was dull.

      rs.syncFrom('192.168.159.135:27017')
      {
         "syncFromRequested" : "192.168.159.135:27017",
         "prevSyncTarget" : "192.168.159.133:27017",
         "ok" : 1
      }
      

      This problem lead to human error.
      But bearable, because we can avoid it.

      Problem2 : Automatic ghost sync caused by network trouble.

      in 192.168.159.133
      Simulate the network trouble.

      iptables -A INPUT -p tcp --dport 27017 -s 192.168.159.134 -j DROP
      

      Then 192.168.154.134 is still available !!
      192.168.154.134 would change the sync target form primary(192.168.154.133) to slaveDelay secondary(192.168.154.135) and KEEP ALIVE in spite of being delayed !!

      But we (mongo client) cannot realize that 192.168.154.134 is now delayed.
      We think, the node should die (unreachable from client) instead of unexpected delay.
      Then we (client) can read fresh data from primary.

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            crumbjp Hiroaki
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: