Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.0.12
Component/s: Replication
Labels:
- RF
- initialSync

Operating System:
ALL
Sprint:
Repl 2019-07-01
Case:
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Summary
A replica-set with a high document churn, in particular where documents are frequently updated and deleted (the same document is updated then deleted later), where a member tries to initial sync it will suffer from a low replication rate when it first switches to catching up on the oplog. The replication rate can be so low that the initial sync may eventually fail in this final stage.

Scenario
Start with a collection with a few hundred thousand documents in it. Run a script that updates documents somewhat randomly, but also deletes and inserts some - in particular, it is important that documents which are being updated are also candidates for deletion, with new documents being inserted to take their place. The idea is to make the sync'ing member collection copy highly skewed across time such that when it starts to apply the oplog it will have updates for documents that don't exist in either its own store or in any remote member. This will result in a message like the following:

[repl writer worker 1] missing object not found on source. presumably deleted later in oplog

These messages are not fatal, the scenario can be recovered, but the appearance of this message is correlated with both very slow oplog application and connection churning observed on the sync source from the member being sync'ed.

The oplog application speeds up dramatically once it gets past the crossover point where no more operations are occurring for deleted documents.

Reproduction
The attached script can reproduce this as follows:

Create a single member replica-set.
Run tdb.insertUntil(500000) in the shell and go read a book (this is safe to run from multiple mongo shells to improve throughput)
Add a new member, which should start sync'ing
Run tdb.execChurner(1000,20,50) in the shell - this loops 1000 times over the following:
- Find 50 sort-of random documents, make an array of their _id
- Apply a random update to each document in the _id array, delete last document in the array (remove document from mongodb and pop that array entry), and repeat until the array is empty
- Sleep for 20ms
Watch what happens when the sync'ing member switches to applying the oplog.

If this is monitored closely the sync'ing member is likely to continue to fall further behind as it applies oplog entries too slowly. As the number of "missing object not found on source" messages drop the replication rate increases. Once no more messages of that type occur, replication rate skyrockets.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

tdb.js
7 kB
Jun 09 2016 06:17:06 AM UTC

related to

SERVER-15410 Batch fetch missing documents during initial sync, with retries

Closed

Assignee:: Evin Roesle
Reporter:: Andrew Ryder (Inactive)
Participants:: Andrew Ryder, Evin Roesle, Scott Hernandez
Votes:: 3 Vote for this issue
Watchers:: 24 Start watching this issue

Created:: Jun 09 2016 06:17:06 AM UTC
Updated:: Jan 03 2020 09:48:13 PM UTC
Resolved:: Jan 03 2020 09:48:13 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates