Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16546

Mongod memory grows until killed with wiredTiger repl set with small oplog

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 2.8.0-rc2
    • Fix Version/s: 2.8.0-rc4
    • Component/s: Replication, Storage
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Steps To Reproduce:
      Hide

      ycsb load of 400M documents. 32 threads. Mongod started with ops below. It's a single node replication set. Ignore the string shard in the replset name – that was left over from an earlier experiment.

       "cmdLineOpts": {
              "parsed": {
                  "replication": {
                      "oplogSizeMB": 200,
                      "replSetName": "BenchmarkShard_0"
                  },
                  "processManagement": {"fork": true},
                  "systemLog": {
                      "path": "/mongolog/BenchmarkCluster_BenchmarkShard_0_81/mongodb.log",
                      "destination": "file"
                  },
                  "config": "/data/BenchmarkCluster_BenchmarkShard_0_81/automation-mongod.conf",
                  "net": {
                      "port": 27000,
                      "http": {"enabled": false}
                  },
                  "storage": {
                      "engine": "wiredTiger",
                      "dbPath": "/data/BenchmarkCluster_BenchmarkShard_0_81"
                  }
              },

      Show
      ycsb load of 400M documents. 32 threads. Mongod started with ops below. It's a single node replication set. Ignore the string shard in the replset name – that was left over from an earlier experiment. "cmdLineOpts": { "parsed": { "replication": { "oplogSizeMB": 200, "replSetName": "BenchmarkShard_0" }, "processManagement": {"fork": true}, "systemLog": { "path": "/mongolog/BenchmarkCluster_BenchmarkShard_0_81/mongodb.log", "destination": "file" }, "config": "/data/BenchmarkCluster_BenchmarkShard_0_81/automation-mongod.conf", "net": { "port": 27000, "http": {"enabled": false} }, "storage": { "engine": "wiredTiger", "dbPath": "/data/BenchmarkCluster_BenchmarkShard_0_81" } },

      Description

      Running the load phase for ycsb with a single node repl set with small oplog, the memory grows consistently until mongod is killed.

      Running MCI build for githash e067fff4e3f3079d070ec168f32c24db9a51a944
      and RC2.

      Mongod on Amazon EC2 Linux AMI. c3.4xlarge server instance. m3.medium instance for ycsb client. Data on SSD. 32 GB of RAM. Uses up to 30 GB before being killed or descheduled.

      1. 16546.js
        1 kB
        Darren Wood
      2. 16546.stats.tgz
        128 kB
        Darren Wood
      3. gdbmon.rc3.log
        73.66 MB
        David Daly
      4. iostat.rc3.log
        1.15 MB
        David Daly
      5. mongodb.rc3.log
        4.26 MB
        David Daly
      6. rc3-single-repl.pdf
        153 kB
        David Daly
      7. rc3-single-repl-dbstats.pdf
        81 kB
        David Daly
      8. rc3-single-repl-hardware.pdf
        141 kB
        David Daly
      9. ss.rc3.log
        23.95 MB
        David Daly
      1. Screen Shot 2014-12-18 at 10.37.02 am.png
        49 kB

        Issue Links

          Activity

          Hide
          david.daly David Daly added a comment -

          Attaching logs and mms graphs.

          Show
          david.daly David Daly added a comment - Attaching logs and mms graphs.
          Hide
          mark.benvenuto Mark Benvenuto added a comment -

          Summary: __evict_server/__evict_has_work needs to run when it owns the oldest __split_oldest_gen.

          Repro:

          1. Setup mongod as a single node replica set with a small cache size. Example:

            mongod.exe --storageEngine=wiredTiger "--wiredTigerEngineConfig=cache_size=1GB" --replSet FooSet --oplogSize=200

          2. Grab ycsb from here: https://github.com/achille/YCSB. Follow the directions in mongodb/README.md
          3. Run the following

            ./bin/ycsb load mongodb -s -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p updateretrycount=1 -p mongodb.database=ycsb -p recordcount=200000000 -p exportmeasurementsinterval=30000 -p fieldcount=10 -p timeseries.granularity=100 -p threadcount=32 -p insertretrycount=10 -p readretrycount=1 -p ignoreinserterrors=true -p reconnectionthroughput=10 -p mongodb.url=mongodb://localhost -p fieldnameprefix=f -p maxexecutiontime=72000 -p fieldlength=10 -p reconnectiontime=1000 -p operationcount=50000000 -p insertstart=0 -p insertcount=200000000  > a.out

          4. To validate that the bug has been triggered, run this query. It takes approximately 5 minutes after starting YCSB in my test environment. The issue occurs sometime after the cache fills up.

            db.serverStatus().wiredTiger.reconciliation

          When you see high counts in the "awaiting" counters like the following below, you know you have hit the problem:

          FooSet:PRIMARY> db.serverStatus().wiredTiger.reconciliation
          {
                  "page reconciliation calls" : 335308,
                  "page reconciliation calls for eviction" : 220201,
                  "split bytes currently awaiting free" : 9747827,
                  "split objects currently awaiting free" : 8564
          }

          In my repro, I see the following in the debugger

          0:002> dt -a39  0x00000019`b087a040 __wt_session_impl split*
          mongod!__wt_session_impl
          [0] @ 00000019`b087a040 
          ---------------------------------------------
             +0x250 split_stash : (null) 
             +0x258 split_stash_cnt : 0
             +0x260 split_stash_alloc : 0
             +0x268 split_gen : 0
           
          [1] @ 00000019`b087a2c0 
          ---------------------------------------------
             +0x250 split_stash : 0x00000019`91c0c010 __wt_split_stash
             +0x258 split_stash_cnt : 0x33a5
             +0x260 split_stash_alloc : 0x78000
             +0x268 split_gen : 0x3178
           
          [2] @ 00000019`b087a540 
          ---------------------------------------------
             +0x250 split_stash : (null) 
             +0x258 split_stash_cnt : 0
             +0x260 split_stash_alloc : 0
             +0x268 split_gen : 0
           
          [3] @ 00000019`b087a7c0 
          ---------------------------------------------
             +0x250 split_stash : (null) 
             +0x258 split_stash_cnt : 0
             +0x260 split_stash_alloc : 0
             +0x268 split_gen : 0
          ...

          Show
          mark.benvenuto Mark Benvenuto added a comment - Summary: __evict_server/__evict_has_work needs to run when it owns the oldest __split_oldest_gen . Repro: Setup mongod as a single node replica set with a small cache size. Example: mongod.exe --storageEngine=wiredTiger "--wiredTigerEngineConfig=cache_size=1GB" --replSet FooSet --oplogSize=200 Grab ycsb from here: https://github.com/achille/YCSB . Follow the directions in mongodb/README.md Run the following ./bin/ycsb load mongodb -s -p workload=com.yahoo.ycsb.workloads.CoreWorkload -p updateretrycount=1 -p mongodb.database=ycsb -p recordcount=200000000 -p exportmeasurementsinterval=30000 -p fieldcount=10 -p timeseries.granularity=100 -p threadcount=32 -p insertretrycount=10 -p readretrycount=1 -p ignoreinserterrors=true -p reconnectionthroughput=10 -p mongodb.url=mongodb://localhost -p fieldnameprefix=f -p maxexecutiontime=72000 -p fieldlength=10 -p reconnectiontime=1000 -p operationcount=50000000 -p insertstart=0 -p insertcount=200000000 > a.out To validate that the bug has been triggered, run this query. It takes approximately 5 minutes after starting YCSB in my test environment. The issue occurs sometime after the cache fills up. db.serverStatus().wiredTiger.reconciliation When you see high counts in the "awaiting" counters like the following below, you know you have hit the problem: FooSet:PRIMARY> db.serverStatus().wiredTiger.reconciliation { "page reconciliation calls" : 335308, "page reconciliation calls for eviction" : 220201, "split bytes currently awaiting free" : 9747827, "split objects currently awaiting free" : 8564 } In my repro, I see the following in the debugger 0:002> dt -a39 0x00000019`b087a040 __wt_session_impl split* mongod!__wt_session_impl [0] @ 00000019`b087a040 --------------------------------------------- +0x250 split_stash : (null) +0x258 split_stash_cnt : 0 +0x260 split_stash_alloc : 0 +0x268 split_gen : 0   [1] @ 00000019`b087a2c0 --------------------------------------------- +0x250 split_stash : 0x00000019`91c0c010 __wt_split_stash +0x258 split_stash_cnt : 0x33a5 +0x260 split_stash_alloc : 0x78000 +0x268 split_gen : 0x3178   [2] @ 00000019`b087a540 --------------------------------------------- +0x250 split_stash : (null) +0x258 split_stash_cnt : 0 +0x260 split_stash_alloc : 0 +0x268 split_gen : 0   [3] @ 00000019`b087a7c0 --------------------------------------------- +0x250 split_stash : (null) +0x258 split_stash_cnt : 0 +0x260 split_stash_alloc : 0 +0x268 split_gen : 0 ...

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: