-
Type: Bug
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Server Triage
-
ALL
Problem Statement:
We are currently undergoing a migration from MongoDB version 4.4 to 5.0, and during this process, we have encountered some intriguing observations that deviate from our anticipated outcomes. Despite consulting various forums, existing MongoDB JIRA tickets, and documentation, we have been unable to pinpoint concrete reasons for these unexpected findings.
We have 5 member replicaSet , 4 data bearing and 1 is ARBITER. 2 members on each data centers.
- defaultWriteConcern configured as w:1, timeout:0
- minSnapshotHistoryWindowInSeconds set 1sec
- enableFlowControl set to False
members when one of the clusters goes down, we see WiredTigetHS.wt starts growing which is stored in the dbPath . Even though we set the minSnapshotHistoryWindowInSeconds to 1s, the file growing every second. Once it touches the oplog size, both oplog and this file grows simultaneously and filling up the disk fast. It depends on the work load, but in our case, it fills quickly in 5-10 mins
Once the majority members are UP, we see that WiredTigerHS.wt back to normal (less than 5MB).
We have the following questions.
- Why doesn't WiredTiger History Store (WT HS) adhere to the minSnapshotHistoryWindowInSeconds configuration during failover or when a majority is not available?
- What is causing the continuous growth, and is there any specific configuration that might help control this expansion?
- Is there a missing configuration setting to manage this growth, . We tried setting minSnapshotHistoryWindowInSeconds to 0 but no change.
- Why does the oplog simultaneously expand when WiredTigerHS.wt reaches the exact size as the oplog file?
- Particularly, since our application doesn't utilize snapshot reads, what is the procedure for completely disabling the WiredTiger History Store?
- Even when the cluster-wide defaultWriteConcern is set to w:1 (not a majority), why does MongoDB still aim to maintain the "Majority Commit Point" and allow the oplog to grow?
Out of 5 members, 2 members still can go down as per the mongoDB fault tolerance and we expect no impact to the running primary and secondary. But starting from 5.0 due to the HS and oplog growth, we see mongoDB cant tolerate when majority is not available. Note that we set the defaultWriteConcern to 1 (Default is changed to majority starting from 5.0).
Before Failover
=========sessionmgr09==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5348387447"),
"count" : 1280626,
"avgObjSize" : 4176,
"storageSize" : NumberLong("6390153216"),
"freeStorageSize" : 1371807744,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 9.7G
rw------ 1 root root 6.0G Dec 11 20:49 collection-18-7685412961092358367.wt
rw------ 1 root root 879M Dec 11 20:48 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 20:48 collection-98-7685412961092358367.wt
rw------ 1 root root 876M Dec 11 20:48 collection-90-7685412961092358367.wt
tmpfs 95G 36G 59G 38% /var/data/sessions.1
=========sessionmgr10==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5360045160"),
"count" : 1283416,
"avgObjSize" : 4176,
"storageSize" : NumberLong("5466677248"),
"freeStorageSize" : 488841216,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 8.8G
rw------ 1 root root 5.1G Dec 11 20:48 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 20:49 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 20:49 collection-82--7079402678170407987.wt
rw------ 1 root root 874M Dec 11 20:49 collection-74--7079402678170407987.wt
tmpfs 95G 36G 59G 38% /var/data/sessions.1
Once the 2 members were brought down
=========sessionmgr09==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5362478203"),
"count" : 1284154,
"avgObjSize" : 4175,
"storageSize" : NumberLong("6390153216"),
"freeStorageSize" : 1267474432,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 11G
rw------ 1 root root 6.0G Dec 11 20:53 collection-18-7685412961092358367.wt
rw------ 1 root root 882M Dec 11 20:53 WiredTigerHS.wt
rw------ 1 root root 879M Dec 11 20:53 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 20:53 collection-98-7685412961092358367.wt
tmpfs 95G 37G 58G 40% /var/data/sessions.1
=========sessionmgr10==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5319883881"),
"count" : 1273920,
"avgObjSize" : 4175,
"storageSize" : NumberLong("5466677248"),
"freeStorageSize" : 307101696,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 9.7G
rw------ 1 root root 5.1G Dec 11 20:53 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 20:53 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 20:53 WiredTigerHS.wt
rw------ 1 root root 875M Dec 11 20:53 collection-82--7079402678170407987.wt
tmpfs 95G 37G 58G 39% /var/data/sessions.1
Within 15 mins the size grows in GB. Note that the oplog still remains same.
=========sessionmgr09==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5809646773"),
"count" : 1391589,
"avgObjSize" : 4174,
"storageSize" : NumberLong("6390140928"),
"freeStorageSize" : 1050791936,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 15G
rw------ 1 root root 6.0G Dec 11 21:05 collection-18-7685412961092358367.wt
rw------ 1 root root 5.2G Dec 11 21:05 WiredTigerHS.wt
rw------ 1 root root 879M Dec 11 21:05 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 21:05 collection-98-7685412961092358367.wt
tmpfs 95G 42G 53G 45% /var/data/sessions.1
=========sessionmgr10==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5815976272"),
"count" : 1393115,
"avgObjSize" : 4174,
"storageSize" : NumberLong("5466664960"),
"freeStorageSize" : 70377472,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 14G
rw------ 1 root root 5.2G Dec 11 21:05 WiredTigerHS.wt
rw------ 1 root root 5.1G Dec 11 21:05 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 21:05 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 21:05 collection-82--7079402678170407987.wt
tmpfs 95G 41G 54G 44% /var/data/sessions.1
Once the WTHS.wt reaches oplog size, oplog size also increasing on sessionmgr10.
=========sessionmgr09==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("6104465326"),
"count" : 1462085,
"avgObjSize" : 4175,
"storageSize" : NumberLong("6390140928"),
"freeStorageSize" : 876965888,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 16G
rw------ 1 root root 6.0G Dec 11 21:05 collection-18-7685412961092358367.wt
rw------ 1 root root 5.6G Dec 11 21:05 WiredTigerHS.wt
rw------ 1 root root 879M Dec 11 21:05 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 21:05 collection-98-7685412961092358367.wt
tmpfs 95G 42G 53G 45% /var/data/sessions.1
=========sessionmgr10==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("6111182492"),
"count" : 1463674,
"avgObjSize" : 4175,
"storageSize" : NumberLong("5601685504"),
"freeStorageSize" : 17518592,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 15G
rw------ 1 root root 5.5G Dec 11 21:05 WiredTigerHS.wt
rw------ 1 root root 5.3G Dec 11 21:05 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 21:05 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 21:05 collection-82--7079402678170407987.wt
tmpfs 95G 41G 54G 44% /var/data/sessions.1
Now the sessionmgr09's oplog also start growing as WTHS.wt reaches the oplog size
=========sessionmgr09==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("6942499375"),
"count" : 1662602,
"avgObjSize" : 4175,
"storageSize" : NumberLong("6490021888"),
"freeStorageSize" : 16769024,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 17G
rw------ 1 root root 6.3G Dec 11 21:07 WiredTigerHS.wt
rw------ 1 root root 6.1G Dec 11 21:07 collection-18-7685412961092358367.wt
rw------ 1 root root 879M Dec 11 21:07 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 21:07 collection-98-7685412961092358367.wt
tmpfs 95G 43G 52G 45% /var/data/sessions.1
=========sessionmgr10==============
{
"ns" : "local.oplog.rs",
"size" : NumberLong("6948652674"),
"count" : 1664076,
"avgObjSize" : 4175,
"storageSize" : NumberLong("6466686976"),
"freeStorageSize" : 14090240,
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 16G
rw------ 1 root root 6.2G Dec 11 21:07 WiredTigerHS.wt
rw------ 1 root root 6.1G Dec 11 21:07 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 21:07 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 21:07 collection-82--7079402678170407987.wt
tmpfs 95G 43G 53G 45% /var/data/sessions.1
Within few mins
=========sessionmgr09===Mon Dec 11 21:16:42 UTC 2023===========
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5315937817"),
"count" : 1274037,
"avgObjSize" : 4172,
"storageSize" : NumberLong("9466707968"),
"freeStorageSize" : NumberLong("4267544576"),
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 21G
rw------ 1 root root 8.9G Dec 11 21:16 collection-18-7685412961092358367.wt
rw------ 1 root root 8.0G Dec 11 21:15 WiredTigerHS.wt
rw------ 1 root root 879M Dec 11 21:16 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 21:16 collection-98-7685412961092358367.wt
tmpfs 95G 47G 48G 50% /var/data/sessions.1
=========sessionmgr10===Mon Dec 11 21:16:43 UTC 2023===========
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5320907873"),
"count" : 1275216,
"avgObjSize" : 4172,
"storageSize" : NumberLong("9433522176"),
"freeStorageSize" : NumberLong("4194209792"),
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 21G
rw------ 1 root root 8.8G Dec 11 21:16 collection-18--7079402678170407987.wt
rw------ 1 root root 8.4G Dec 11 21:15 WiredTigerHS.wt
rw------ 1 root root 876M Dec 11 21:16 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 21:15 collection-82--7079402678170407987.wt
tmpfs 95G 48G 47G 51% /var/data/sessions.1
Brought back the down members, on the secondary sm10, the WTHS.wt reduced to 4MB. After Sometime the sm09 (PRIMARY) WTHS.wt also reduced to 4MB.
=========sessionmgr09===Mon Dec 11 21:22:16 UTC 2023===========
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5366640045"),
"count" : 1286026,
"avgObjSize" : 4173,
"storageSize" : NumberLong("9466695680"),
"freeStorageSize" : NumberLong("4469882880"),
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 21G
rw------ 1 root root 8.9G Dec 11 21:21 collection-18-7685412961092358367.wt
rw------ 1 root root 8.0G Dec 11 21:21 WiredTigerHS.wt
rw------ 1 root root 879M Dec 11 21:21 collection-74-7685412961092358367.wt
rw------ 1 root root 878M Dec 11 21:22 collection-98-7685412961092358367.wt
tmpfs 95G 47G 48G 50% /var/data/sessions.1
=========sessionmgr10===Mon Dec 11 21:22:17 UTC 2023===========
{
"ns" : "local.oplog.rs",
"size" : NumberLong("5367983767"),
"count" : 1286360,
"avgObjSize" : 4173,
"storageSize" : NumberLong("9433522176"),
"freeStorageSize" : NumberLong("4348190720"),
"capped" : true,
"max" : 0,
"maxSize" : NumberLong("5368709120"),
total 13G
rw------ 1 root root 8.8G Dec 11 21:21 collection-18--7079402678170407987.wt
rw------ 1 root root 876M Dec 11 21:22 collection-98--7079402678170407987.wt
rw------ 1 root root 875M Dec 11 21:21 collection-82--7079402678170407987.wt
rw------ 1 root root 874M Dec 11 21:21 collection-74--7079402678170407987.wt
tmpfs 95G 39G 56G 42% /var/data/sessions.1
Runtime parameters
/usr/bin/mongod --ipv6 --slowms 500 --storageEngine wiredTiger --wiredTigerCacheSizeGB 12 --setParameter oplogFetcherUsesExhaust=false --setParameter wiredTigerConcurrentReadTransactions=16 --setParameter wiredTigerConcurrentWriteTransactions=16 --setParameter maxIndexBuildMemoryUsageMegabytes=500 --bind_ip_all --keyFile /root/.dbkey --port 27717 --dbpath=/var/data/sessions.1/e --replSet set01e --fork --pidfilepath /var/run/sessionmgr-27717.pid --oplogSize 5120 --logpath /var/log/mongodb-27717.log --logappend --quiet
- is related to
-
SERVER-95247 Replica Set Version Field Limit Reached Prevents Reconfiguration
- Closed