Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-4927

Slaves stops replog sync if another slaves used fsyncLock

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.0.2
    • Component/s: Replication
    • Environment:
      Linux 2.6.32-38-server, Ubuntu 10.04, MongoDB 2.0.1, Replicaset with 4 Nodes, NUMA, 2x XEON E5620 , 24 GB RAM
    • Replication
    • ALL

      This is our Setup
      2 Shards, each a ReplicaSet with 4 Nodes. 1 Node is dedicated for backups (priority:0,hidden:true)
      If we start a backup we send the backup node the fsyncLock command and then start a rsync of the filesystem.
      After we have finished the backup we send the fsyncUnLock command to the backup node.

      If we have a master switch (due to upgrade or failure) in the ReplicaSet we encounter the problem that some or all slaves stops oplog syncing when the backup node starts the backup. It is exactly the same moment as we start the fsyncLock command, since the replication lag is the same for the backup nodes and the slaves which also stops syncing. When the backup is finished the other slaves also starts syncing again.
      db.currentOp() doesn't show the fsyncLock on the slaves, only on the backup node.
      To get rid of this problem we have to start the non backup slave. After this restart the slave runs well and never stop syncing again together with the backup node.

      This is the second time we've encoutered this problem. Since this is our production environment we don't want to force a master switch if not needed.

      It seems that the cause of this problem is the master switch in the replicaset.

      Regards,
      Steffen

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            steffen Steffen
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: