Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-24482

Initial sync during high document update/churn causes repl-worker slowness, connection churn

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.0.12
    • Component/s: Replication
    • Labels:
    • ALL
    • Repl 2019-07-01

      Summary
      A replica-set with a high document churn, in particular where documents are frequently updated and deleted (the same document is updated then deleted later), where a member tries to initial sync it will suffer from a low replication rate when it first switches to catching up on the oplog. The replication rate can be so low that the initial sync may eventually fail in this final stage.

      Scenario
      Start with a collection with a few hundred thousand documents in it. Run a script that updates documents somewhat randomly, but also deletes and inserts some - in particular, it is important that documents which are being updated are also candidates for deletion, with new documents being inserted to take their place. The idea is to make the sync'ing member collection copy highly skewed across time such that when it starts to apply the oplog it will have updates for documents that don't exist in either its own store or in any remote member. This will result in a message like the following:

      [repl writer worker 1] missing object not found on source. presumably deleted later in oplog
      

      These messages are not fatal, the scenario can be recovered, but the appearance of this message is correlated with both very slow oplog application and connection churning observed on the sync source from the member being sync'ed.

      The oplog application speeds up dramatically once it gets past the crossover point where no more operations are occurring for deleted documents.

      Reproduction
      The attached script can reproduce this as follows:

      1. Create a single member replica-set.
      2. Run tdb.insertUntil(500000) in the shell and go read a book (this is safe to run from multiple mongo shells to improve throughput)
      3. Add a new member, which should start sync'ing
      4. Run tdb.execChurner(1000,20,50) in the shell - this loops 1000 times over the following:
        • Find 50 sort-of random documents, make an array of their _id
        • Apply a random update to each document in the _id array, delete last document in the array (remove document from mongodb and pop that array entry), and repeat until the array is empty
        • Sleep for 20ms
      5. Watch what happens when the sync'ing member switches to applying the oplog.

      If this is monitored closely the sync'ing member is likely to continue to fall further behind as it applies oplog entries too slowly. As the number of "missing object not found on source" messages drop the replication rate increases. Once no more messages of that type occur, replication rate skyrockets.

            Assignee:
            evin.roesle@mongodb.com Evin Roesle
            Reporter:
            andrew.ryder@mongodb.com Andrew Ryder (Inactive)
            Votes:
            3 Vote for this issue
            Watchers:
            24 Start watching this issue

              Created:
              Updated:
              Resolved: