|
matt.lee, I have looked more closely at the diagnostic data you attached to this ticket.
Several unfortunate things are going on here. In particular, there is >56GB of dirty data in cache later in the workload, which means checkpoints are taking a very long time to complete. In addition, and more problematic for performance, there is very large amount of memory that has been freed by MongoDB that is not being reused or returned to the operating system (over 44GB peak). This is resulting in paging, leading to poor performance.
We are currently working on general solutions to both of these issues. In the meantime, can you please try running this workload with a smaller cache size. For example:
mongod --wiredTigerCacheGB 30
|
Please let us know if that improves this workload.
|
|
Hi Michael Cahill, Alexander Corrod,
>> Configuration used to start mongod
Nothing special in the configuration.
replication:
|
oplogSizeMB: 102400
|
replSetName: testreplset3
|
|
storage:
|
syncPeriodSecs: 60
|
directoryPerDB: true
|
dbPath: /data/mongodb
|
journal:
|
enabled: true
|
commitIntervalMs: 100
|
|
engine: wiredTiger
|
|
wiredTiger:
|
engineConfig:
|
cacheSizeGB: 40
|
journalCompressor: none
|
directoryForIndexes: false
|
collectionConfig:
|
blockCompressor: snappy
|
indexConfig:
|
prefixCompression: false
|
>> Whether running in a replication group (if so, whether the issue is specific to the master or slaves)
4 shard and each shard has 3 replica in the set, This issue is happened primary node. (Querys are only requested to primary not secondary)
>> Whether running in a sharded environment
Yes sharded. I attached sharding status. "sharding_status.txt"
>> Can you describe the workload you are running when this issue happens?
Read :
testdb.testcoll.find({pid:?}).sort({guid:-1}).limit(100)
|
Write :
testdb.testcoll.update({ pid: ?, key: ? }, { $set: { pid: ?, key: ?, guid: ?, ...}, {upsert:true}}
|
There's only one collection and that collection has 5 indexes
{_id:1},
|
{pid:hashed},
|
{pid:1, key:1},
|
{pid:1, guid:1},
|
{pid:1, ..}
|
>> Is there any chance that we can reproduce it locally?
Unfournately, this is real service traffic. So I can't prepare test scenario which reproduce this issue.
MongoDB log file is too big. So I copied head of log file. (And also log file contains our real service data)
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] MongoDB starting : pid=129951 port=27017 dbpath=/data/mongodb 64-bit host=test03-mongo2
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] db version v3.2.5
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] git version: 34e65e5383f7ea1726332cb175b73077ec4a1b02
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] allocator: tcmalloc
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] modules: none
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] build environment:
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] distarch: x86_64
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] target_arch: x86_64
|
2016-04-25T10:53:25.971+0900 I CONTROL [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, operationProfiling: { mode: "slowOp", slowOpThresholdMs: 100 }, processManagement: { fork: true, pidFilePath: "/data/mongodb/mongod.pid" }, replication: { oplogSizeMB: 102400, replSetName: "testreplset3" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/data/mongodb", directoryPerDB: true, engine: "wiredTiger", journal: { commitIntervalMs: 100, enabled: true }, syncPeriodSecs: 60.0, wiredTiger: { collectionConfig: { blockCompressor: "snappy" }, engineConfig: { cacheSizeGB: 40, directoryForIndexes: false, journalCompressor: "none" }, indexConfig: { prefixCompression: false } } }, systemLog: { destination: "file", logAppend: false, path: "/data/mongodb/mongod.log" } }
|
2016-04-25T10:53:25.992+0900 W - [initandlisten] Detected unclean shutdown - /data/mongodb/mongod.lock is not empty.
|
2016-04-25T10:53:25.992+0900 W STORAGE [initandlisten] Recovering data from the last clean checkpoint.
|
2016-04-25T10:53:25.992+0900 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=40G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=none),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
|
And during the issue, there's no something special in the log file except a few finding split vector message.
2016-05-02T21:20:07.307+0900 W SHARDING [conn47385] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 22871 took 166ms
|
2016-05-02T21:22:09.698+0900 W SHARDING [conn47385] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 31714 took 103ms
|
2016-05-02T21:22:33.966+0900 W SHARDING [conn47369] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 34239 took 105ms
|
2016-05-02T21:23:27.379+0900 W SHARDING [conn47369] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 29688 took 120ms
|
2016-05-02T21:27:37.989+0900 W SHARDING [conn47475] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 45094 took 4485ms
|
2016-05-02T21:27:38.529+0900 W SHARDING [conn47476] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 45094 took 4718ms
|
2016-05-02T21:28:14.707+0900 W SHARDING [conn47492] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 117 took 9617ms
|
2016-05-02T21:28:15.041+0900 W SHARDING [conn47496] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 117 took 4877ms
|
2016-05-02T21:28:19.681+0900 W SHARDING [conn47499] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 20872 took 3382ms
|
2016-05-02T21:28:19.879+0900 W SHARDING [conn47369] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 20872 took 8743ms
|
2016-05-02T21:29:26.750+0900 W SHARDING [conn47521] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 49862 took 10096ms
|
2016-05-02T21:29:26.858+0900 W SHARDING [conn47523] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 49863 took 3678ms
|
2016-05-02T21:29:28.758+0900 W SHARDING [conn47524] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 49863 took 3113ms
|
2016-05-02T21:29:44.800+0900 W SHARDING [conn47496] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 41594 took 8938ms
|
2016-05-02T21:29:44.809+0900 W SHARDING [conn47492] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 41594 took 7376ms
|
2016-05-02T21:29:53.196+0900 W SHARDING [conn47476] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 51153 took 11125ms
|
2016-05-02T21:29:53.633+0900 W SHARDING [conn47475] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 51154 took 3709ms
|
2016-05-02T21:29:54.270+0900 W SHARDING [conn47529] Finding the split vector for db.coll over { profile_id: "hashed" } keyCount: 125672 numSplits: 1 lookedAt: 51154 took 3811ms
|
...
|
Thanks.
Matt.
<<I changed server name and database collection name instantly, so there's something not matched among log>>
|