Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-35774

Initial Sync not replicating old oplog entries may have a stale node give up resync and permanently stay in RECOVERING state

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.0
    • Component/s: Internal Code
    • Labels:
      None
    • ALL

      The following is the shortest schedule that can manifest the bug

       

      Environment Setup:

      Environment is where I have 2 TB hard disk drives and ran all server nodes on the same host. This seems cause slow startup time on some nodes. (Creating preallocated 'local' files seems to take long time – few minutes). OS is Ubuntu 16.04. I used MongoDB version 3.0.0.  Also, I configured to turn off the chaining.

      Initially, start three nodes. Say node 1,2 and 3. Create 3 documents initialized as {"_id" : "key0", "value" : "0"},  {"_id" : "key1", "value" : "1"} , {"_id" : "key2",   "value" : "2" }, respectively.

       

      Then, do following steps:

      1. Assuming node 0 is the primary, crash node 1 and 2. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key0" to incrementally set "value" : "1000.1", "value" : "1000.2" ... "value" : "1000.9". Crash node 0.
      2. Restart node 0 and 1. Wait for their optime get converged.
      3. Crash node 1. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key1" to incrementally set "value" : "1001.1", "value" : "1001.2" ... "value" : "1001.9". Crash node 0.
      4. Restart node 0 and 2. Wait for their optime get converged.
      5. Crash node 2. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key2" to incrementally set "value" : "1002.1", "value" : "1002.2" ... "value" : "1002.9". Crash node 0.
      6. Restart node 1 and 2. Wait for their optime get converged.

      At the step 2, node 1 bumps up its optime – say 10. At the step 4, node 2 bumps up its optime --say 20. Note that both node 1 and node 2 did resync with node 0 via initial sync because oplog was empty. Oplog is empty on both node 1 and 2 because replication has never made from node 0 due to slow startup time of node 1 and 2 during the initialization phase. ( By inspecting log messages, figured that they ran very slowly and never get a chance to replicate even the initial create commands.)

      Based on my source code reading, the function _initialSync() in rs_initialsync.cpp just clones the data and only add the oplog entry for the last operation for the corresponding collection. Hence, after the step 2, node 1 only has the oplog entry for the update at optime 10. After the step 4, node 2 only has the oplog entry for the update at optime 20. They do not have older oplog entries.

      As a result, at the step 6, the node 2 has the oldest oplog entry at optime 20. Yet, the node 1 has the last optime at 10. Thus, the node 1 refuses to resync with the node 2 because there are no older oplog entries – entries at optime 11 to 19.  This is no problem if the oplog on node 1 was empty because it will fall back to initial sync. However, since the node 1 has the some oplog entry, it does not fall back to initial sync.

      Therefore, the node 1 permanently stays in RECOVERING state. This will block convergence and clients cannot read data from the node 1 indefinitely. It requires manual effort to explicitly use other utilities to trigger initial sync. Doing an initial sync is expensive, and probably that is why MongoDB is designed to be hung instead of automatically triggering the initial sync in such a case we discussed above.

      Yet, to reduce the chance of falling into the situation where we need to go through the expensive initial sync, I think we should replicate some number of older oplog entries even during the initial sync.

            Assignee:
            nick.brewer Nick Brewer
            Reporter:
            beomheyn.kim@gmail.com anaud anaud
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: