Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.0
Component/s: Internal Code
Labels:
None

Operating System:
ALL
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The following is the shortest schedule that can manifest the bug

—

Environment Setup:

Environment is where I have 2 TB hard disk drives and ran all server nodes on the same host. This seems cause slow startup time on some nodes. (Creating preallocated 'local' files seems to take long time – few minutes). OS is Ubuntu 16.04. I used MongoDB version 3.0.0. Also, I configured to turn off the chaining.

Initially, start three nodes. Say node 1,2 and 3. Create 3 documents initialized as {"_id" : "key0", "value" : "0"}, {"_id" : "key1", "value" : "1"} , {"_id" : "key2", "value" : "2" }, respectively.

—

Then, do following steps:

Assuming node 0 is the primary, crash node 1 and 2. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key0" to incrementally set "value" : "1000.1", "value" : "1000.2" ... "value" : "1000.9". Crash node 0.
Restart node 0 and 1. Wait for their optime get converged.
Crash node 1. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key1" to incrementally set "value" : "1001.1", "value" : "1001.2" ... "value" : "1001.9". Crash node 0.
Restart node 0 and 2. Wait for their optime get converged.
Crash node 2. Then, invoke multiple updateOne to repeatedly update the document with "_id" : "key2" to incrementally set "value" : "1002.1", "value" : "1002.2" ... "value" : "1002.9". Crash node 0.
Restart node 1 and 2. Wait for their optime get converged.

—

At the step 2, node 1 bumps up its optime – say 10. At the step 4, node 2 bumps up its optime --say 20. Note that both node 1 and node 2 did resync with node 0 via initial sync because oplog was empty. Oplog is empty on both node 1 and 2 because replication has never made from node 0 due to slow startup time of node 1 and 2 during the initialization phase. ( By inspecting log messages, figured that they ran very slowly and never get a chance to replicate even the initial create commands.)

Based on my source code reading, the function _initialSync() in rs_initialsync.cpp just clones the data and only add the oplog entry for the last operation for the corresponding collection. Hence, after the step 2, node 1 only has the oplog entry for the update at optime 10. After the step 4, node 2 only has the oplog entry for the update at optime 20. They do not have older oplog entries.

As a result, at the step 6, the node 2 has the oldest oplog entry at optime 20. Yet, the node 1 has the last optime at 10. Thus, the node 1 refuses to resync with the node 2 because there are no older oplog entries – entries at optime 11 to 19. This is no problem if the oplog on node 1 was empty because it will fall back to initial sync. However, since the node 1 has the some oplog entry, it does not fall back to initial sync.

Therefore, the node 1 permanently stays in RECOVERING state. This will block convergence and clients cannot read data from the node 1 indefinitely. It requires manual effort to explicitly use other utilities to trigger initial sync. Doing an initial sync is expensive, and probably that is why MongoDB is designed to be hung instead of automatically triggering the initial sync in such a case we discussed above.

Yet, to reduce the chance of falling into the situation where we need to go through the expensive initial sync, I think we should replicate some number of older oplog entries even during the initial sync.

Assignee:: Nick Brewer (Inactive)
Reporter:: anaud anaud
Participants:: anaud anaud, Nick Brewer
Votes:: 0 Vote for this issue
Watchers:: 7 Start watching this issue

Created:: Jun 25 2018 03:10:38 PM UTC
Updated:: May 15 2025 12:56:15 PM UTC
Resolved:: Jun 25 2018 05:59:49 PM UTC

Details

Description

Attachments

Activity

People

Dates