[SERVER-28709] Initial Sync from MMAP to WT fails if large document create by oplog Created: 10/Apr/17  Updated: 06/Dec/22  Resolved: 10/Apr/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Henderson Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-7515 idempotence violation when intermedia... Open
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:
  • Create a large(ish) document (<16MB) in a MMAP replicaset
  • Initiate an initalsync
  • While this is happening, apply an update that would push the document size over the 16MB limit
  • Wait for the initial copy to finish, catching up from the oplog will fail.
Participants:

 Description   

Initial sync from a MMAPv1 replicaset to a new WT secondary, both running 3.4.3:

2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] db version v3.4.3
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] git version: f07437fb5a6cca07c10bafa78365456eb1d6d5e1
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] OpenSSL version: OpenSSL 1.0.1f 6 Jan 2014
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] allocator: tcmalloc
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] modules: none
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread] build environment:
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     distmod: ubuntu1404
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     distarch: x86_64
2017-04-10T06:00:01.604+0000 I CONTROL  [signalProcessingThread]     target_arch: x86_64

The initial copy of the data was successful, but the oplog catchup stage failed due to an update that breached the 16MB limit:

2017-04-06T16:04:04.054+0000 F REPL     [repl writer worker 11] warning: log line attempted (147kB) over max size (10kB), printing beginning and end ... writer worker caught exception:  :: caused by :: 17
419 Resulting document after update is larger than 16777216 on: { ts: Timestamp 1491429857000|404, h: 8450293174279352223, v: 2, op: "u", ns: "[SNIP - large update]", o2: { _id: ObjectId('58961877c14
1580404e7f604') }, o: { $set: { [SNIP] } }
2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] missing object not found on source. presumably deleted later in oplog
2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] o2: { _id: ObjectId('58e4e2e9bc44b6c99ed9bb2b') }
2017-04-06T16:04:04.055+0000 I REPL     [repl writer worker 8] o firstfield: $set
2017-04-06T16:04:04.067+0000 E REPL     [replication-59] Failed to apply batch due to 'Location17419: error applying batch: Resulting document after update is larger than 16777216'
2017-04-06T16:04:04.075+0000 I REPL     [replication-61] Finished fetching oplog during initial sync: CallbackCanceled: fetcher was shut down after previous batch was processed. Last fetched optime and ha
sh: { ts: Timestamp 1491494643000|612, t: -1 }[1653548606848978015]
2017-04-06T16:04:04.075+0000 I REPL     [replication-61] Initial sync attempt finishing up.
2017-04-06T16:04:04.082+0000 I REPL     [replication-61] warning: log line attempted (491kB) over max size (10kB), printing beginning and end ... Initial Sync Attempt Statistics: { failedInitialSyncAttemp
ts: 0, maxFailedInitialSyncAttempts: 10, initialSyncStart: new Date(1491397073200), initialSyncAttempts: [], fetchedMissingDocs: 110321, appliedOps: 27140000, initialSyncOplogStart: Timestamp 149139707400
0|51, initialSyncOplogEnd: Timestamp 1491494643000|612, databases: { databasesCloned: 44, [SNIP]: { collections: 1, clonedCollections: 1, start: new Date(1491397074124), end: new Date(1491397074208), elapsedMillis: 84, [SNIP]: { documentsToCopy: 14, documentsCopied: 14, indexes: 1, fetchedBatches: 1, start: new Date(1491397074124), end: new Date(1491397074208), elapsedMillis: 84 } }, [SNIP]: { collections: 51, clonedCollections: 51, start: new Date(1491397074208), end: new Date(1491397729132), elapsedMillis: 654924, [SNIP]: { documentsToCopy: 1, documentsCopied: 1, indexes: 1, fetchedBatches: 1, start: new Date(1491397074209), end: new Date(1491397074287), elapsedMillis: 78 }, [SNIP]: { documentsToCopy: 310824, documentsCopied: 310834, indexes: 2, fetchedBatches: 129, start: new Date(1491397074287), end: new Date(1491397082539), elapsedMillis: 8252 }, [SNIP]: { documentsToCopy: 2236029, documentsCopied: 2236405, indexes: 2, fetchedBatches: 109, start: new Date(1491397082539), end: new Date(1491397133928), elapsedMillis: 51389 }, [SNIP]: { documentsToCopy: 30770, documentsCopied: 30770, indexes: 2, fetchedBatches: 1, start: new Date(1491397133928), end: new Date(1491397134604), elapsedMillis: 676 }, [SNIP]: { documentsToCopy: 1188, documentsCopied: 1188, indexes: 1, fetchedBatches: 1, start: new Date(1491397134604), end: new Date(1491397134679), elapsedMillis: 75 }, [SNIP]: { documentsToCopy: 111810, documentsCopied: 111811, indexes: 2, fetchedBatches: 5, start: new Date(1491397134679), end: new Date(1491397137511), elapsedMillis: 2832 }, [SNIP]: { documentsToCopy: 16767, documentsCopied: 16767, indexes: 1, fetchedBatches: 1, start: new Date(1491397137511), end: new Date(1491397137922), elapsedMillis: 411 }, [SNIP]: { documentsToCopy: 85965, documentsCopied: 85969, indexes: 2, fetchedBatches: 11, start: new Date(1491397137921), end: new Date(1491397146202), elapsedMillis: 8281 }, [SNIP]: { documentsToCopy: 67820, documentsCopied: 67826, indexes: 2, fetchedBatches: 15, start: new Date(1491397146202), end: new Date(1491397156706), elapsedMillis: 10504 }, [SNIP]: { documentsToCopy: 5, documentsCopied: 5, indexes: 2, fetchedBatches: 1, start: new Date(1491397156706), end: new Date(1491397156765), elapsedMillis: 59 }, [SNIP]: { documentsToCopy: 18827, documentsCopied: 18827, indexes: 2, fetchedBatches: 1, start: new Date(1491397156765), end: new Date(1491397157548), elapsedMillis: 783 }, [SNIP]: { documentsToCopy: 9603, documentsCopied: 9603, indexes: 1, fetchedBatches: 1, start: new Date(1491397157547), end: new Date(1491397157821), elapsedMillis: 274 }, [SNIP]: { documentsToCopy: 86, documentsCopied: 86, indexes: 1, fetchedBatches: 1, start: new Date(1491397157821), end: new Date(1491397157842), e .......... alerts.krwfihep_client_alerts: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492468641), end: new Date(1491492468683), elapsedMillis: 42 }, [SNIP]: { documentsToCopy: 513, documentsCopied: 513, indexes: 7, fetchedBatches: 1, start: new Date(1491492468683), end: new Date(1491492468943), elapsedMillis: 260 }, [SNIP]: { documentsToCopy: 7, documentsCopied: 7, indexes: 7, fetchedBatches: 1, start: new Date(1491492468943), end: new Date(1491492469036), elapsedMillis: 93 }, [SNIP]: { documentsToCopy: 3245, documentsCopied: 3245, indexes: 7, fetchedBatches: 1, start: new Date(1491492469036), end: new Date(1491492470635), elapsedMillis: 1599 }, [SNIP]: { documentsToCopy: 921, documentsCopied: 921, indexes: 7, fetchedBatches: 1, start: new Date(1491492470635), end: new Date(1491492470715), elapsedMillis: 80 }, [SNIP]: { documentsToCopy: 1747, documentsCopied: 1747, indexes: 7, fetchedBatches: 1, start: new Date(1491492470715), end: new Date(1491492470993), elapsedMillis: 278 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492470993), end: new Date(1491492471052), elapsedMillis: 59 }, [SNIP]: { documentsToCopy: 21, documentsCopied: 21, indexes: 7, fetchedBatches: 1, start: new Date(1491492471052), end: new Date(1491492471102), elapsedMillis: 50 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471102), end: new Date(1491492471168), elapsedMillis: 66 }, [SNIP]: { documentsToCopy: 4, documentsCopied: 4, indexes: 7, fetchedBatches: 1, start: new Date(1491492471168), end: new Date(1491492471241), elapsedMillis: 73 }, [SNIP]: { documentsToCopy: 6, documentsCopied: 6, indexes: 7, fetchedBatches: 1, start: new Date(1491492471240), end: new Date(1491492471317), elapsedMillis: 77 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471317), end: new Date(1491492471389), elapsedMillis: 72 }, [SNIP]: { documentsToCopy: 0, documentsCopied: 0, indexes: 7, fetchedBatches: 0, start: new Date(1491492471389), end: new Date(1491492471468), elapsedMillis: 79 }, [SNIP]: { documentsToCopy: 1385, documentsCopied: 1385, indexes: 7, fetchedBatches: 1, start: new Date(1491492471468), end: new Date(1491492471827), elapsedMillis: 359 }, [SNIP]: { documentsToCopy: 598, documentsCopied: 598, indexes: 7, fetchedBatches: 1, start: new Date(1491492471827), end: new Date(1491492472019), elapsedMillis: 192 }, [SNIP]: { documentsToCopy: 100, documentsCopied: 100, indexes: 7, fetchedBatches: 1, start: new Date(1491492472019), end: new Date(1491492472094), elapsedMillis: 75 }, [SNIP]: { documentsToCopy: 79333, documentsCopied: 79333, indexes: 7, fetchedBatches: 14, start: new Date(1491492472094), end: new Date(1491492492457), elapsedMillis: 20363 } } } }
2017-04-06T16:04:04.082+0000 E REPL     [replication-61] Initial sync attempt failed -- attempts left: 9 cause: Location17419: error applying batch: Resulting document after update is larger than 16777216
2017-04-06T16:04:05.083+0000 I REPL     [replication-59] Starting initial sync (attempt 2 of 10)
2017-04-06T16:04:05.133+0000 I REPL     [replication-59] sync source candidate: mongodbsvr1-p2:27028
2017-04-06T16:04:06.220+0000 I STORAGE  [replication-59] dropAllDatabasesExceptLocal 44
2017-04-06T16:04:24.446+0000 I REPL     [replication-59] ******
2017-04-06T16:04:24.446+0000 I REPL     [replication-59] creating replication oplog of size: 51200MB...
2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] WiredTigerRecordStoreThread local.oplog.rs already started
2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] The size storer reports that the oplog contains 0 records totaling to 0 bytes
2017-04-06T16:04:24.451+0000 I STORAGE  [replication-59] Scanning the oplog to determine where to place markers for truncation
2017-04-06T16:04:24.515+0000 I REPL     [replication-59] ******

This then caused the whole initialsync process to fail, dropping all the data and starting again from the beginning. If there is an entry in the oplog that the MMAP engine coped with (to replicate to the secondary), then the initial sync should cope in the same way (rejecting the update and carrying on?).



 Comments   
Comment by Eric Milkie [ 10/Apr/17 ]

One way this can happen is via SERVER-7515.

Generated at Thu Feb 08 04:18:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.