Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-18842

WiredTiger & indexing: "kernel: Out of memory: Kill process 32011 (mongod) score 966 or sacrifice child"

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 3.0.3
    • Fix Version/s: None
    • Component/s: Indexing, WiredTiger
    • Labels:
      None
    • Operating System:
      ALL

      Description

      The hidden secondary server I use for backups cannot sync its data from the replica set. It's a replica set of 3 servers including the one which get killed by the kernel. All are running mongod 3.0.3.

      I'm currently trying to migrate from mmapv1 to WiredTiger. The two main servers use mmapv1. The initial sync goes fine, copying our 292 million documents, mongod takes around 80% of the RAM. Then it starts building the index on _id. After a few %, the kernel kills mongod. I attached the logs from /var/log/messages.

      If I relaunch mongod, it retries building the index, then exits because of (the famous ?):

      Index Build "Index rebuilding did not complete: 11000 E11000 duplicate key error collection: rum.reports index: _id_ dup key: { : "11jjgtfncmtn9w47mco7vtbbj" }"
      

      If I relaunch mongod with indexBuildRetry=false, it restarts cloning the collection from the beginning after showing:

      _2015-06-05T16:17:37.683+0000 W REPL     [ReplicationExecutor] Failed to load timestamp of most recently applied operation; NoMatchingDocument Did not find any entries in local.oplog.rs_
      

      This server has only 16GB RAM, and can do a sync with mmapv1 without any problem. The database size is 946GB on mmapv1, and 243GB on WiredTiger (snappy).

      I started running the following command when the cloning was at 236M docs. I attached the result.

      mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(10*1000)}" > ss.log
      

      Cloning finishes at 2015-06-05T16:09:38.003+0000.
      Index creation starts at 2015-06-05T16:09:45.606+0000.

      The last log before the kill:

      2015-06-05T16:12:00.565+0000 I -        [rsSync]   Index Build: 13494400/292298635 4%
      

      On another replica set, this migration worked fine with a database size of 402GB mmapv1 / 133GB WT. For the record, the sync also took all the RAM, but index creation went fine. What's interesting is that after a restart of mongod on this server, RAM usage went from 100% to 17%. After a few days of continuous writes from the oplog, it's now at 43% RAM. I did not launch any query or make any backup from this secondary.

      The two replica sets contain the same kind of documents, and the same indexes.

      My storage configuration:

      storage:
        dbPath: /a/b/c
        indexBuildRetry: true
        journal:
          enabled: true
        directoryPerDB: false # default
        syncPeriodSecs: 60 # default
       
        engine: wiredTiger
       
        wiredTiger:
          engineConfig:
      #      cacheSizeGB: 12 # Default: the maximum of half of physical RAM or 1 gigabyte.
            statisticsLogDelaySecs: 0 # If set to 0, WiredTiger does not log statistics.
            journalCompressor: snappy # Default: snappy
            directoryForIndexes: false # Default: false
          collectionConfig:
            blockCompressor: snappy # Default: snappy
          indexConfig:
            prefixCompression: true # Default: true
      

      There's this storage.wiredTiger.engineConfig.cacheSizeGB setting. I don't really get what it means. I tried setting the default, or setting it at 14. RAM usage still goes beyond the server capacity.

      This issue could be related to SERVER-17456 and SERVER-17424. I created a new issue, because this one happens during an initial sync, not because of benchmarks.

      In case someone at MongoDB is interested, the servers are monitored by MMS:

      • The failing group is 556dc2ffe4b0ba305ae49354. The hidden Secondary is the only one running on WiredTiger. It's the one that I try to sync again and again.
      • The ok group is 556dce19e4b0eea5366e6ae5. The hidden Secondary is the only one running on WiredTiger.

        Attachments

        1. ss.log.gz
          472 kB
        2. WT sync with cacheSizeGB at 1.png
          WT sync with cacheSizeGB at 1.png
          44 kB
        3. Cache Usage and Activity 1GB.png
          Cache Usage and Activity 1GB.png
          94 kB
        4. Memory usage over time with WiredTiger.png
          Memory usage over time with WiredTiger.png
          46 kB

          Issue Links

            Activity

              People

              Assignee:
              ramon.fernandez Ramon Fernandez Marina
              Reporter:
              nicolas_ Nicolas Fouché
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: