Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28709

Initial Sync from MMAP to WT fails if large document create by oplog

    • Type: Icon: Bug Bug
    • Resolution: Duplicate
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.3
    • Component/s: Replication
    • Labels:
      None
    • Replication
    • ALL
    • Hide
      • Create a large(ish) document (<16MB) in a MMAP replicaset
      • Initiate an initalsync
      • While this is happening, apply an update that would push the document size over the 16MB limit
      • Wait for the initial copy to finish, catching up from the oplog will fail.
      Show
      Create a large(ish) document (<16MB) in a MMAP replicaset Initiate an initalsync While this is happening, apply an update that would push the document size over the 16MB limit Wait for the initial copy to finish, catching up from the oplog will fail.

      Initial sync from a MMAPv1 replicaset to a new WT secondary, both running 3.4.3:

      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] db version v3.4.3
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] allocator: tcmalloc
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] modules: none
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] build environment:
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     distmod: ubuntu1404
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     distarch: x86_64
      2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     target_arch: x86_64
      

      The initial copy of the data was successful, but the oplog catchup stage failed due to an update that breached the 16MB limit:

      2017-04-06T16:04:04.054+0000 F REPL     [repl writer worker 11] warning: log line attempted (147kB) over max size (10kB), printing beginning and end ... writer worker caught exception:  :: caused by :: 17
      419 Resulting document after update is larger than 16777216 on: { ts: Timestamp 1491429857000|404, h: 8450293174279352223, v: 2, op: "u", ns: "[SNIP - large update]", o2: { _id: ObjectId('58961877c14
      1580404e7f604') }, o: { $set: { [SNIP] } }
      2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] missing object not found on source. presumably deleted later in oplog
      2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] o2: { _id: ObjectId('58e4e2e9bc44b6c99ed9bb2b') }
      2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] o firstfield: $set
      2017-04-06T16:04:04.067+0000 E REPL     [replication-59] Failed to apply batch due to 'Location17419: error applying batch: Resulting document after update is larger than 16777216'
      2017-04-06T16:04:04.075+0000 I REPL     [replication-61] Finished fetching oplog during initial sync: CallbackCanceled: fetcher was shut down after previous batch was processed. Last fetched optime and ha
      sh: { ts: Timestamp 1491494643000|612, t: -1 }[1653548606848978015]
      2017-04-06T16:04:04.075+0000 I REPL     [replication-61] Initial sync attempt finishing up.
      2017-04-06T16:04:04.082+0000 I REPL     [replication-61] warning: log line attempted (491kB) over max size (10kB), printing beginning and end ... Initial Sync Attempt Statistics: { failedInitialSyncAttemp
      ts: 0, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1491397073200), initialSyncAttempts: [], fetchedMissingDocs: 110321, appliedOps: 27140000, initialSyncOplogStart: Timestamp 149139707400
      0|51, initialSyncOplogEnd: Timestamp 1491494643000|612, databases: { databasesCloned: 44, [SNIP]: { collections: 1, clonedCollections: 1, start: new Date(1491397074124), end: new Date(1491397074208), elapsedMillis: 84, [SNIP]: { documentsToCopy: 14, documentsCopied: 14, indexes: 1, fetchedBatches: 1, start: new Date(1491397074124), end: new Date(1491397074208), elapsedMillis: 84 } }, [SNIP]: { collections: 51, clonedCollections: 51, start: new Date(1491397074208), end: new Date(1491397729132), elapsedMillis: 654924, [SNIP]: { documentsToCopy: 1, documentsCopied: 1, indexes: 1, fetchedBatches: 1, start: new Date(1491397074209), end: new Date(1491397074287), elapsedMillis: 78 }, [SNIP]: { documentsToCopy: 310824, documentsCopied: 310834, indexes: 2, fetchedBatches: 129, start: new Date(1491397074287), end: new Date(1491397082539), elapsedMillis: 8252 }, [SNIP]: { documentsToCopy: 2236029, documentsCopied: 2236405, indexes: 2, fetchedBatches: 109, start: new Date(1491397082539), end: new Date(1491397133928), elapsedMillis: 51389 }, [SNIP]: { documentsToCopy: 30770, documentsCopied: 30770, indexes: 2, fetchedBatches: 1, start: new Date(1491397133928), end: new Date(1491397134604), elapsedMillis: 676 }, [SNIP]: { documentsToCopy: 1188, documentsCopied: 1188, indexes: 1, fetchedBatches: 1, start: new Date(1491397134604), end: new Date(1491397134679), elapsedMillis: 75 }, [SNIP]: { documentsToCopy: 111810, documentsCopied: 111811, indexes: 2, fetchedBatches: 5, start: new Date(1491397134679), end: new Date(1491397137511), elapsedMillis: 2832 }, [SNIP]: { documentsToCopy: 16767, documentsCopied: 16767, indexes: 1, fetchedBatches: 1, start: new Date(1491397137511), end: new Date(1491397137922), elapsedMillis: 411 }, [SNIP]: { documentsToCopy: 85965, documentsCopied: 85969, indexes: 2, fetchedBatches: 11, start: new Date(1491397137921), end: new Date(1491397146202), elapsedMillis: 8281 }, [SNIP]: { documentsToCopy: 67820, documentsCopied: 67826, indexes: 2, fetchedBatches: 15, start: new Date(1491397146202), end: new Date(1491397156706), elapsedMillis: 10504 }, [SNIP]: { documentsToCopy: 5, documentsCopied: 5, indexes: 2, fetchedBatches: 1, start: new Date(1491397156706), end: new Date(1491397156765), elapsedMillis: 59 }, [SNIP]: { documentsToCopy: 18827, documentsCopied: 18827, indexes: 2, fetchedBatches: 1, start: new Date(1491397156765), end: new Date(1491397157548), elapsedMillis: 783 }, [SNIP]: { documentsToCopy: 9603, documentsCopied: 9603, indexes: 1, fetchedBatches: 1, start: new Date(1491397157547), end: new Date(1491397157821), elapsedMillis: 274 }, [SNIP]: { documentsToCopy: 86, documentsCopied: 86, indexes: 1, fetchedBatches: 1, start: new Date(1491397157821), end: new Date(1491397157842), e .......... alerts.krwfihep_client_alerts: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492468641), end: new Date(1491492468683), elapsedMillis: 42 }, [SNIP]: { documentsToCopy: 513, documentsCopied: 513, indexes: 7, fetchedBatches: 1, start: new Date(1491492468683), end: new Date(1491492468943), elapsedMillis: 260 }, [SNIP]: { documentsToCopy: 7, documentsCopied: 7, indexes: 7, fetchedBatches: 1, start: new Date(1491492468943), end: new Date(1491492469036), elapsedMillis: 93 }, [SNIP]: { documentsToCopy: 3245, documentsCopied: 3245, indexes: 7, fetchedBatches: 1, start: new Date(1491492469036), end: new Date(1491492470635), elapsedMillis: 1599 }, [SNIP]: { documentsToCopy: 921, documentsCopied: 921, indexes: 7, fetchedBatches: 1, start: new Date(1491492470635), end: new Date(1491492470715), elapsedMillis: 80 }, [SNIP]: { documentsToCopy: 1747, documentsCopied: 1747, indexes: 7, fetchedBatches: 1, start: new Date(1491492470715), end: new Date(1491492470993), elapsedMillis: 278 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492470993), end: new Date(1491492471052), elapsedMillis: 59 }, [SNIP]: { documentsToCopy: 21, documentsCopied: 21, indexes: 7, fetchedBatches: 1, start: new Date(1491492471052), end: new Date(1491492471102), elapsedMillis: 50 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471102), end: new Date(1491492471168), elapsedMillis: 66 }, [SNIP]: { documentsToCopy: 4, documentsCopied: 4, indexes: 7, fetchedBatches: 1, start: new Date(1491492471168), end: new Date(1491492471241), elapsedMillis: 73 }, [SNIP]: { documentsToCopy: 6, documentsCopied: 6, indexes: 7, fetchedBatches: 1, start: new Date(1491492471240), end: new Date(1491492471317), elapsedMillis: 77 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471317), end: new Date(1491492471389), elapsedMillis: 72 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471389), end: new Date(1491492471468), elapsedMillis: 79 }, [SNIP]: { documentsToCopy: 1385, documentsCopied: 1385, indexes: 7, fetchedBatches: 1, start: new Date(1491492471468), end: new Date(1491492471827), elapsedMillis: 359 }, [SNIP]: { documentsToCopy: 598, documentsCopied: 598, indexes: 7, fetchedBatches: 1, start: new Date(1491492471827), end: new Date(1491492472019), elapsedMillis: 192 }, [SNIP]: { documentsToCopy: 100, documentsCopied: 100, indexes: 7, fetchedBatches: 1, start: new Date(1491492472019), end: new Date(1491492472094), elapsedMillis: 75 }, [SNIP]: { documentsToCopy: 79333, documentsCopied: 79333, indexes: 7, fetchedBatches: 14, start: new Date(1491492472094), end: new Date(1491492492457), elapsedMillis: 20363 } } } }
      2017-04-06T16:04:04.082+0000 E REPL     [replication-61] Initial sync attempt failed -- attempts left: 9 cause: Location17419: error applying batch: Resulting document after update is larger than 16777216
      2017-04-06T16:04:05.083+0000 I REPL     [replication-59] Starting initial sync (attempt 2 of 10)
      2017-04-06T16:04:05.133+0000 I REPL     [replication-59] sync source candidate: mongodbsvr1-p2:27028
      2017-04-06T16:04:06.220+0000 I STORAGE  [replication-59] dropAllDatabasesExceptLocal 44
      2017-04-06T16:04:24.446+0000 I REPL     [replication-59] ******
      2017-04-06T16:04:24.446+0000 I REPL     [replication-59] creating replication oplog of size: 51200MB...
      2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] WiredTigerRecordStoreThread local.oplog.rs already started
      2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] The size storer reports that the oplog contains 0 records totaling to 0 bytes
      2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] Scanning the oplog to determine where to place markers for truncation
      2017-04-06T16:04:24.515+0000 I REPL     [replication-59] ******
      

      This then caused the whole initialsync process to fail, dropping all the data and starting again from the beginning. If there is an entry in the oplog that the MMAP engine coped with (to replicate to the secondary), then the initial sync should cope in the same way (rejecting the update and carrying on?).

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            david.henderson@triggeredmessaging.com David Henderson
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: