Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-31584

limit initial sync lookahead buffer by size during oplog application phase

    • Type: Icon: Bug Bug
    • Resolution: Won't Fix
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.6, 3.6.0
    • Component/s: Replication
    • Replication
    • ALL

      Initial sync uses a lookahead buffer, introduced in SERVER-26191) to read from the temporary oplog buffer collection during the oplog application phase. The size of this buffer is currently specified in terms of the number of documents it can contain. The current default of 10,000, which suggests that this cache can hold up to 160 GB of data, may result in OOM errors.

      We should either reduce the default to a more reasonable number, or constrain the size of the buffer by the total size of all buffered documents.

      ------------------------- OLD DESCRIPTION BELOW -------------------

      In the following example the oplog application phase used ~6 GB of memory. This can lead to OOM during initial sync.

      In the above example "A" corresponds to the following:

      2017-10-13T03:44:10.775-0500 I REPL     [replication-140] Applying operations until { : Timestamp 1507884249000|1 } before initial sync can complete. (starting at { : Timestamp 1507832798000|1 })
      

      The memory was allocated by this stack:

      2017-10-13T03:44:11.064-0500 I -        [ftdc] heapProfile stack1651: { 0: "tc_malloc", 1: "mongo::mongoMalloc", 2: "mongo::BSONObj::copy", 3: "mongo::BSONObj::getOwned", 4: "0x7f734da6c793", 5: "mongo::repl::StorageInterfaceImpl::findDocuments", 6: "mongo::repl::OplogBufferCollection::_peek_inlock", 7: "mongo::repl::OplogBufferCollection::peek", 8: "mongo::repl::OplogBufferProxy::peek", 9: "mongo::repl::InitialSyncer::_getNextApplierBatch_inlock", 10: "mongo::repl::InitialSyncer::_getNextApplierBatchCallback", 11: "std::_Function_handler<void ", 12: "mongo::executor::ThreadPoolTaskExecutor::runCallback", 13: "0x7f734dd2866b", 14: "mongo::ThreadPool::_doOneTask", 15: "mongo::ThreadPool::_consumeTasks", 16: "mongo::ThreadPool::_workerThreadBody", 17: "0x7f734ea0a0f0", 18: "0x7f7349767dc5", 19: "clone" }
      

            Assignee:
            backlog-server-repl [DO NOT USE] Backlog - Replication Team
            Reporter:
            bruce.lucas@mongodb.com Bruce Lucas (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            20 Start watching this issue

              Created:
              Updated:
              Resolved: