[SERVER-21782] WiredTiger freezing at 100% cache in use until service restart (daily) Created: 07/Dec/15  Updated: 06/Mar/16  Resolved: 06/Mar/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Anthony Pastor Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Wheezy.
MongoDB 3.0.7


Attachments: JPEG File 21782.2632.jpg     File 5jan16-logs.tar.gz     PNG File Dec18.png     File capture-2016-01-09.tar.gz     File coll-stats-12jan16.tar.gz     PNG File combined_usage-day.png     File logging_20151208.tar.gz     File ls-lR-memc5-30dec15.txt.gz     Text File memc2-100pct-logs.txt     PNG File mongodb_wt_cache_usage.png     File mongostat-18dec15.log.gz     File sar.out     File server-status-18dec15.log.gz     PNG File ss_mongodb.png     PNG File timeseries-2040.png    
Issue Links:
Related
Operating System: Linux
Participants:

 Description   

On a daily basis we have to restart MongoDB nodes (secondaries and primary) because performance degrades dramatically.

We can observe the percentage of WiredTiger cache (%use column when we use mongostat command) in use are freezing to 100% (and this percentage never decrease by it's own if we don't restart the node).

After the node has been restarted, the %use decrease to ~80% generally.
I've added an attachment of a graph displaying the %use on our servers.

We tried to increase the cacheSize parameter.
Here on a 64GB server:
wiredTiger:
engineConfig:
cacheSizeGB: 56

The issue is still the same even with this configuration.



 Comments   
Comment by Ramon Fernandez Marina [ 06/Mar/16 ]

Thanks for the update antoine@stickyadstv.com, glad to hear 3.2 is working better for you. A lot of work went into making WiredTiger the default storage engine in 3.2, so while 3.0 may not have worked well for your usage case I believe you'll be better off using 3.2 anyway – specially if you're using WiredTiger. And you get a bunch of cool new features as well!

Since the main issue reported in this ticket (100% cache usage and slowdown) is not present in 3.2 I'm going to close this ticket for the time being. If the same issue appears again please post here and we can either reopen the ticket or create a new one as needed.

Thanks,
Ramón.

Comment by Antoine Bonavita [ 06/Mar/16 ]

Hello Ramon,

We successfully completed tests with 3.2.3 on our preproduction environment. So, earlier this week we deployed 3.2.3 on one of our secondaries in production. For now (5 days later) we haven't seen the infamous 100% cache in use come back on this machine. It even looks like the rest of the replicaset is in better shape (less restarts necessary because of 100% cache use). Next step will be to upgrade the primary of this replicaset.

I'll keep you posted.

Thanks for the help.

A.

Comment by Ramon Fernandez Marina [ 05/Mar/16 ]

Apologies for the radio silence antoine@stickyadstv.com. Unfortunately I have no more information for you at this time. Have you had a chance to see if the numerous improvements made for 3.2 take care of this behavior?

Thanks,
Ramón.

Comment by Antoine Bonavita [ 17/Feb/16 ]

Hello Ramon,

We haven't been able to test with 3.2 yet but we are still seeing the problem and this morning it happened on the primary (which has been at 100% for hours).

Here is the currentOp (removed ops that have been running for 0 secs for easier reading):

sfx:PRIMARY> db.currentOp({secs_running:{$gt:0}})
{
	"inprog" : [
		{
			"desc" : "conn88337",
			"threadId" : "0x58d82b260",
			"connectionId" : 88337,
			"opid" : -1006602359,
			"active" : true,
			"secs_running" : 13,
			"microsecs_running" : NumberLong(13242197),
			"op" : "query",
			"ns" : "stats.tmp.mr.ad_bidder_zone_geo_daily_241257",
			"query" : {
				"$msg" : "query not recording (too large)"
			},
			"client" : "172.16.18.4:45930",
			"msg" : "m/r: merge sort and reduce",
			"numYields" : 0,
			"locks" : {
				"Global" : "w",
				"Database" : "w"
			},
			"waitingForLock" : true,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(585),
						"w" : NumberLong(583)
					},
					"acquireWaitCount" : {
						"w" : NumberLong(1)
					},
					"timeAcquiringMicros" : {
						"w" : NumberLong(765819)
					}
				},
				"Database" : {
					"acquireCount" : {
						"r" : NumberLong(1),
						"w" : NumberLong(582),
						"W" : NumberLong(2)
					},
					"acquireWaitCount" : {
						"r" : NumberLong(1),
						"w" : NumberLong(52),
						"W" : NumberLong(2)
					},
					"timeAcquiringMicros" : {
						"r" : NumberLong(5521487),
						"w" : NumberLong(4495799),
						"W" : NumberLong(6078)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"r" : NumberLong(1),
						"w" : NumberLong(290),
						"W" : NumberLong(1)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"w" : NumberLong(291)
					}
				}
			}
		},
		{
			"desc" : "conn35123",
			"threadId" : "0x1b705e080",
			"connectionId" : 35123,
			"opid" : -1006599710,
			"active" : true,
			"secs_running" : 3,
			"microsecs_running" : NumberLong(3190796),
			"op" : "query",
			"ns" : "stats.tmp.mr.ad_bidder_zone_geo_daily_241259",
			"query" : {
				"$msg" : "query not recording (too large)"
			},
			"client" : "172.16.18.2:34083",
			"msg" : "m/r: merge sort and reduce",
			"numYields" : 0,
			"locks" : {
				"Global" : "w",
				"Database" : "w"
			},
			"waitingForLock" : true,
			"lockStats" : {
				"Global" : {
					"acquireCount" : {
						"r" : NumberLong(247),
						"w" : NumberLong(247)
					},
					"acquireWaitCount" : {
						"w" : NumberLong(1)
					},
					"timeAcquiringMicros" : {
						"w" : NumberLong(767093)
					}
				},
				"Database" : {
					"acquireCount" : {
						"w" : NumberLong(246),
						"W" : NumberLong(2)
					},
					"acquireWaitCount" : {
						"w" : NumberLong(16),
						"W" : NumberLong(2)
					},
					"timeAcquiringMicros" : {
						"w" : NumberLong(205220),
						"W" : NumberLong(182268)
					}
				},
				"Collection" : {
					"acquireCount" : {
						"w" : NumberLong(122),
						"W" : NumberLong(1)
					}
				},
				"oplog" : {
					"acquireCount" : {
						"w" : NumberLong(123)
					}
				}
			}
		}
	]
}

Comment by Antoine Bonavita [ 09/Feb/16 ]

Ramon,

We looked at currentOp on one of the secondaries that was at 100% before restarting it:

sfx4:SECONDARY> db.currentOp();
{
    "inprog" : [
        {
            "desc" : "conn15091",
            "threadId" : "0xd5d56680",
            "connectionId" : 15091,
            "opid" : 660748152,
            "active" : true,
            "secs_running" : 4,
            "microsecs_running" : NumberLong(4732072),
            "op" : "getmore",
            "ns" : "local.oplog.rs",
            "query" : {
                "ts" : {
                    "$gte" : Timestamp(1455026317, 9386)
                }
            },
            "client" : "172.16.18.12:42581",
            "numYields" : 0,
            "locks" : {
               
            },
            "waitingForLock" : false,
            "lockStats" : {
                "Global" : {
                    "acquireCount" : {
                        "r" : NumberLong(10)
                    }
                },
                "Database" : {
                    "acquireCount" : {
                        "r" : NumberLong(5)
                    }
                },
                "oplog" : {
                    "acquireCount" : {
                        "r" : NumberLong(5)
                    }
                }
            }
        }
    ]
}

It was at 100% for more than 4 seconds.

Comment by Antoine Bonavita [ 09/Feb/16 ]

Hello Ramon,

We'll try that when we get a chance and see if killOp could be enough to solve the issue.

A.

Comment by Ramon Fernandez Marina [ 09/Feb/16 ]

antoine-sticky, above I wrote:

there are long-running operations that are preventing data in the cache from being evicted. The rawAggregate function above shows the default is to not use any cursor timeout, so that's the main suspect (as of right now)

This is not accurate: eviction is working, and is working overtime.

If you kill the long-running aggregation operation, do your nodes recover? If they don't that could indicate a bug; if they do then that would indicate this operation is taxing your nodes significantly. If the latter, we can investigate if more yielding is needed for these operations.

Comment by Antoine Bonavita [ 06/Feb/16 ]

Hello,

After a few days running with 13G cache on the 32G machine, things are not much better: the freeze happens as often as it was happening with 28G cache.

Any other option than upgrading to 3.2 ?
Do you have other users who were facing similar issues and saw an improvement with 3.2 ?

A.

Comment by Antoine Bonavita [ 02/Feb/16 ]

Hello Ramon,

Your summary is good.
We used to have the default setting for RAM but tried to increase it to solve the issue.
We had the problem a few minutes ago. The secondary had been at 100% for 50 minutes. I reconfigured the cache to 13G (on a 32G machine). I'll keep you posted if this changes anything.

We can try 3.2 but it might take a bit of time as we need to check it on our preproduction servers before.

In any case, thanks for your help. It's really appreciated.

Comment by Ramon Fernandez Marina [ 01/Feb/16 ]

anthony.pastor, antoine@stickyadstv.com, I went over the data in this ticket, and if I understand correctly this ticket can be summarized as follows:

  • you experience a performance degradation in your MongoDB servers, which correlates with the WiredTiger cache filling up
  • restarting the affected nodes fixes the issue, but it appears again with the same symptoms

The data you've so far shared with us indicates the following:

  • the attempt to grow the size of the cache may be counterproductive, as it may increase swapping and adversely affect performance; I'd recommend you either use the default size (50%/60% of available RAM in 3.0/3.2) or even lower it below the default. While this may not address the issue at hand, it may help node performance
  • there are long-running operations that are preventing data in the cache from being evicted. The rawAggregate function above shows the default is to not use any cursor timeout, so that's the main suspect (as of right now)

Seeing how 3.0.9 didn't help in this respect, would you consider running an experiment with 3.2.1? We've observed much better performance and stability in MongoDB 3.2 when the cache is under stress. If you could upgrade a secondary to 3.2.1 and confirm whether the behavior is the same or different it should help determine how to approach the impact of long-running operations and whether we need a workaround for your aggregation needs.

Thanks,
Ramón.

Comment by Antoine Bonavita [ 01/Feb/16 ]

Hello Dan (and all),

We upgraded our cluster to 3.0.9 and things did not get any better.

Here is how we connect to the mongo DB:

    public function init() {
        $options = $this->getOptions();
        try {
            $params = $options;
            unset($params['dsn']);
            unset($params['db']);
 
            $mongoWrite = new MongoClient($options['dsn'], $params);
            $mongoWrite->connect();
            $mongoRead = new MongoClient($options['dsn'], $params);
            $mongoRead->setReadPreference(MongoClient::RP_SECONDARY_PREFERRED);
            $mongoRead->connect();
 
            return array(
                'clientRead' => $mongoRead,
                'clientWrite' => $mongoWrite,
                'dbConfig' => $options['db'],
            );
        } catch (MongoConnectionException $e) {
            throw new Zend_Exception($e->getMessage());
        }
    } 

Where the config looks like that:

resources.mongo.dsn = "mongodb://10.1.18.4:27021,10.1.18.2:27021/?connectTimeoutMS=1000"
resources.mongo.socketTimeoutMS = 80000

We set a timeout for the queries we run:

    public function rawAggregate(array $pipeline, array $options = array(), $cursorTimeout = null) {
        $this->_startProfiler();
        if ($cursorTimeout !== null) {
            $currentTimeout = MongoCursor::$timeout;
            MongoCursor::$timeout = $cursorTimeout;
        }
        $results = $this->getMongoCollection()->aggregate($pipeline,
                                                               $this->_mergeDefaultCursorOptions($options));
        if ($cursorTimeout !== null) {
            MongoCursor::$timeout = $currentTimeout;
        }
        $this->_profile('aggregate', $pipeline);
        return $results;
    }
 
    private function _mergeDefaultCursorOptions(array $options) {
        $defaultOptions = array(
            'allowDiskUse' => true,
        );
 
        $timeout = MongoCursor::$timeout;
        if ($timeout > 0) {
            $defaultOptions['maxTimeMS'] = $timeout;
        }
        return array_merge($defaultOptions, $options);
    }

Comment by rayshen [ 27/Jan/16 ]

thanks Michael, But New Year is coming, We have a holiday, After a week, I will try to upgrade to 3.0.8 or 3.0.9, If it's usefull, I'll add my information and let you know.

Comment by Michael Cahill (Inactive) [ 27/Jan/16 ]

rayshen, what you are seeing is different to what is reported on this ticket.

If possible, one thing to try is upgrading to a newer release of MongoDB: version 3.0.8 fixed some issues relating to memory use with WiredTiger, and 3.0.9 was just released with additional fixes.

If upgrading is not possible, or does not help, please open a new SERVER ticket with your information and we will help you out there.

Comment by Daniel Pasette (Inactive) [ 26/Jan/16 ]

Hi Antoine, I understand. Can you show me how the connection is configured and how you typically use maxTimeMS in a query?

Comment by Antoine Bonavita [ 26/Jan/16 ]

Hello Dan,

I won't be able to post the exact call config as I have not been able to tie the 100% cache in use to a specific query. If I could, it would make all this much easier. I'll get you a currentOp() next time we get it.

Comment by Daniel Pasette (Inactive) [ 26/Jan/16 ]

Typically, you should only need the maxTimeMS setting, because it should cause the operation to be killed on the server and return a usable status to the application. However, from our earlier analysis, it appears that the aggregation is still running. Can you post the exact call/config you are using on from your application and can you also run db.currentOp() from the mongo shell when the problem is occurring?

Comment by Antoine Bonavita [ 26/Jan/16 ]

Hi Dan,

We are using the php driver and are actually setting both (MongoCursor::timeout and maxTimeMS) and setting them to the same value. This way we are hoping to get the best of two worlds: Exception on the client side and resources not locked up server side.

A.

Comment by Daniel Pasette (Inactive) [ 25/Jan/16 ]

Hi Antoine, sorry I missed this comment from you before, but you mention you have a timeout set on your aggregation query. Can you elaborate if this is using maxTimeMS, a socket timeout?

Comment by Antoine Bonavita [ 25/Jan/16 ]

Dan,

Original poster here. I checked and we don't see the same exceptions you do in the logs. I also checked our swap graphs and we never go beyond 800M (on a 64G machine). Even when it's swapping most, the machine is at 30 pages per second and:
1 - It doesn't last
2 - It's not when we see cache going up to 100% and staying there.

We might have the same symptoms but different problems.

A.

Comment by rayshen [ 25/Jan/16 ]

hi Dan Pasette, Thanks to receive your reply
Sorry for that, Yes, We use master-slave mode, master has 32 nodes, slave has 32 nodes.
The application layer through the hash access 32 shards,Not use the official autosharding.
Every shard only has one collection with 3 billion, The data is so big...

I'm Chinese, Now is BeiJing 02:10 AM , Yes, I haven't sleep,Because WiredTiger cache freezing at 100% again....
I want to know why "used" always keep in 80% in the normal, But what's the cause of 100% ? Can you help me to list some reasons of "used 100%"?

Our query/update is very simple like this:

"2016-01-26T01:55:17.048+0800 I QUERY    [conn86313] query ins_shard7.xxxxxxx query: { aa_id: ObjectId('xxxxxxxxxxxxxxx'), bbid: "xxxxxxxxxxxxxxxxx" }planSummary: IXSCAN { aa_id: 1.0, bbid: 1.0 } ntoskip:0 nscanned:1 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:0nreturned:1 reslen:201 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } 173ms"
 
2016-01-26T01:56:33.597+0800 I WRITE    [conn86246] update ins_shard7.installations query: { _id: ObjectId('xxxxxxxxxxxxxxxx') } update: { $set: { last_ld: "2016-01-24,2016-01-25,2016-01-26" } }nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0numYields:1 locks:{ Global: { acquireCount: { r: 3, w: 3 } }, Database: { acquireCount: { w: 3 } }, Collection: { acquireCount: { w: 3 } } } 412ms

the slow query is so many when "cachesize used" reach 100%, Our application layer is apache storm cluster(Real-time), very high requirements for query time.

Hope to get your help.

Just happened:

insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn     time
    *0   839    797     *0       0   799|0     0.4  100.0       0 97.7G 96.4G   0|0   2|0  365k   301k  168 01:19:18
    *0   834    798     *0       0   803|0     0.4  100.0       0 97.7G 96.4G   0|0   2|0  365k   302k  168 01:19:19
    *0   888    853     *0       0   855|0     0.4   99.8       0 97.8G 96.5G   0|0   1|0  388k   321k  168 01:19:20
    *0  1135   1079     *0       0  1081|0     0.4   99.8       0 97.8G 96.5G   0|0   1|0  495k   401k  168 01:19:21
    *0  1186   1134     *0       0  1137|0     0.4   99.9       0 97.8G 96.5G   0|0   1|1  517k   422k  168 01:19:22
    *0  1071   1036     *0       0  1038|0     0.4   99.9       0 97.8G 96.5G   0|0   1|1  469k   386k  168 01:19:23
    *0   889    827     *0       0   823|0     0.3   99.9       0 97.8G 96.5G   0|0  35|9  372k   310k  168 01:19:24
    *0   964    957     *0       0   968|0     0.2  100.0       0 97.8G 96.5G   0|1   2|0  437k   361k  168 01:19:25
    *0   654    636     *0       0   638|0     0.3   99.9       0 97.8G 96.5G   0|0   1|0  287k   244k  168 01:19:26
    *0  1064   1022     *0       0  1023|0     0.3  100.0       0 97.8G 96.5G   0|1   1|0  465k   382k  168 01:19:27

mongod --port 27623 --master --wiredTigerCacheSizeGB 100 --journalCommitInterval 300 --oplogSize 100000 --storageEngine wiredTiger --fork --dbpath /disk1/xxxx --logpath /disk1/xxxx --logappend --wiredTigerDirectoryForIndexes --directoryperdb

$free -m
                     total       used       free     shared    buffers     cached
Mem:        193799     193096        703          0         14      88099
-/+ buffers/cache:     104982      88817
Swap:         1952        835       1117

Maybe I should go to bed, Good night !

Comment by Daniel Pasette (Inactive) [ 25/Jan/16 ]

Hi rayshen, thanks for adding your information. You mention that you are running as a standalone, but I notice that you are using the startup parameter --master. Is that intentional?

Comment by rayshen [ 25/Jan/16 ]

hi.
I also met this problem, but mine is not replica set mode, it's many stand-alone nodes(32+), I only can restart the mongod instances when the "wiredtiger cache used" up to 100%, it's very terrible for the performance

my configuration:

mongo version: 3.0.7
os ram: 256G
os : centos/redhat 6.2
parameters: mongod --port 27612 --master --wiredTigerCacheSizeGB 200 --journalCommitInterval 300 --oplogSize 100000 --storageEngine wiredTiger --fork --dbpath /disk1/xxxxx --logpath /disk1/logs/xxxxx_3.0.log --logappend --wiredTigerDirectoryForIndexes --directoryperdb

i can sure the requests from applications only inlucde "query, update", no range query

$free -m
total used free shared buffers cached
Mem: 258448 257695 753 0 44 75318
-/+ buffers/cache: 182331 76117
Swap: 2047 2047 0

i find some error in dmesg logs
[21992237.723372] swapper: page allocation failure. order:1, mode:0x20
[21992237.729716] Pid: 0, comm: swapper Not tainted 2.6.32-220.23.2.ali878.el6.x86_64 #1
[21992237.737684] Call Trace:
[21992237.740463] <IRQ> [<ffffffff8112580a>] ? __alloc_pages_nodemask+0x77a/0x930
[21992237.748043] [<ffffffffa024d500>] ? bond_dev_queue_xmit+0x10/0x220 [bonding]
[21992237.755493] [<ffffffff8115f6f2>] ? kmem_getpages+0x62/0x170
[21992237.761490] [<ffffffff8116030a>] ? fallback_alloc+0x1ba/0x270
... ...

it's because no available swap space?
i modify "wiredTigerCacheSizeGB" to 100G or default(1/2 ram) and restart , it's working fine.
but after some days, WiredTiger cache freezing at 100% again, find same errors in dmesg, swap has enough space.

$du -sh *
556G ins_shard12 #my database, only one collection with 3 billion documents
201M journal
28G local
40K _mdb_catalog.wt
4.0K mongod.lock
44K sizeStorer.wt
4.0K storage.bson
4.0K WiredTiger
4.0K WiredTiger.basecfg
4.0K WiredTiger.lock
4.0K WiredTiger.turtle
64K WiredTiger.wt

sorry for my bad description

Comment by Antoine Bonavita [ 19/Jan/16 ]

Hello,

Anything I should do to help you guys identify the source of the problem ?

A.

Comment by Antoine Bonavita [ 12/Jan/16 ]

Hello,

The data stored is analytics data. Mostly, the application uses "recent" data but once in a while someone will indeed get "older" data. I understand this can trigger a spike in the cache usage.
However we have a timeout on our aggregation queries, so after the timeout, shouldn't the query stop and the cache usage go down.

I attach a file with the documents for all our collections. I don't know if that makes a difference but I ran it on the primary (memc1) and on the secondary (memc5). Please note that memc5 was restarted this morning so I don't know if the information will help.

Thanks for your help on this.

A.

Comment by Susan LoVerso [ 12/Jan/16 ]

Hello antoine-sticky. Thank you very much for the data you collected and sent. It was very helpful. In reviewing it, and based on your earlier description, it appears an aggregation is triggering the stalls. Can you give us more information about the aggregation? Do you know how much data is it traversing? Are there indexes involved too?

Based on our analysis of the information, we see that the application and Wiredtiger are not stalled. The beginning of the what we believe is an aggregation is causing a very large amount of load that slows things significantly. It appears it is reading a lot of content that is not only out of the WiredTiger cache, but also not in the OS file system cache. We see a large and sustained increase in OS page faults and disk I/O, particularly reads. Also WiredTiger is writing from the cache in order to make space for the data being read. That is also resulting in additional I/O.

Could you gather collection statistics for the collections in the database (https://docs.mongodb.org/manual/reference/method/db.collection.stats)? If you don't want to upload those to the ticket, we can give you an alternative mechanism. (This does not need to be done at a time when the issue is exhibited, anytime will do.)

The listing of your data directory that you sent a couple weeks shows about 1.2TB of data. The Wiredtiger cache is 28Gb. That is only about 2% of your dataset size. So the size of the aggregation and what it touches is important.

Comment by Antoine Bonavita [ 11/Jan/16 ]

Hello Ramon,

The cache size is actually different depending on the machine because they have different RAM size:

  • memc1 is 64G, cacheSize is 52G
  • memc2 is 64G, cacheSize is 52G
  • memc5 is 32G, cacheSize is 28G

And the recording I attached for Jan 9th is for memc5. Therefore, the 28G seems pretty normal to me.

Comment by Ramon Fernandez Marina [ 11/Jan/16 ]

Thanks for the additional information antoine-sticky. Here's the subset of metrics that looks relevant from the last dataset you sent us, but I haven't been able to find anything wrong with them yet.

You mention a 56GB cache size, but the highest usage is around 28GB.

Comment by Antoine Bonavita [ 11/Jan/16 ]

Hello,

Sorry for the delayed followup but although I tried many times I have not been able to reproduce the situation on a secondary by restarting the other secondary.
However, I have been able to capture the requested information on saturday night (Jan 9th) between 22:00 and 23:00 Paris time. The attached tar.gz contains the following files for the secondary that exhibited the 100% cache in use:

  • iostat.log for the output of iostat -k -t -x 1
  • mongostat.log for the output of mongostat -h
  • ss_mongodb.log for the output of db.serverStatus( {tcmalloc:1}

    ). Timestamps are in GMT.

  • sar.log generated with sar -o sar.log -b -B -r -R -S 1. Please note that I have not been able to restrict the capture and that it contains data from 10:40 to 23:39. Sorry about that.

I hope this gives you what you need to figure out what is going on.

Antoine.

Comment by Susan LoVerso [ 05/Jan/16 ]

The one statistic that stands out from the 18 Dec server stats is page faults. That measures major page faults (i.e. those requiring I/O in the OS). During the time while running at 100%, the average number of OS page faults/sec was 20x what it was during the earlier 80% period.

Can you collect a set information from one of the secondaries - please run several minutes before the trigger at 80% as well as several minutes after at 100%? In particular I want to collect and look at OS related information at the time this happens and correlate it to the mongod server stats. There may also be a few minutes ramp-up period going from 80-100%. Here is the info to collect so that we can compare it all together from a single run, ideally no more than 30 minutes of data:

  • Collect mongod server statistics like you did on 18 Dec.
  • Collect iostat like you did during 17 Dec. (There were several times I saw I/O at 100% for seconds at a time.)
  • Run sar -o sar.log -i $delay -b -B -r -R -S & and send sar.log.

    delay=1
    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep($delay*1000)}" >ss.log &
    iostat -k -t -x $delay >iostat.log &
    sar -o sar.log -i $delay -b -B -r -R -S &
    

  • mongostat output.
  • mongod log.

Also let us know the timestamp that the transition happened.

Comment by Antoine Bonavita [ 05/Jan/16 ]

Hello Sue,

I tried what you suggested on one of the secondaries (memc2) this morning and it triggered the 100% condition on memc5. Therefore I restarted memc5 and it did not have any impact on the primary.

You will find attached the logs of the 3 servers while I was doing that.

It looks to me like restarting can trigger the condition but it's not systematic.

Do you guys have any idea what might be going wrong here ?

Thanks,

Comment by Susan LoVerso [ 04/Jan/16 ]

antoine-sticky that is interesting. While I realize you may not want to do this experiment, if you restart one of the secondaries, but it is not at 100% yet, does it still exhibit the cascading effect on the other secondary and then the primary? If so, it may, at least give us a way to gather statistics and force a transition to the 100% cache usage in a controlled manner.

Comment by Antoine Bonavita [ 04/Jan/16 ]

Over the weekend, I noticed something weird which is probably worth mentioning:

  • One of the secondaries was at 100% used, so I restarted it.
  • Just after that, the other secondary went up to 100% used. So I restarted it as well.
  • Just after that the primary went from 80% up to 100%. Of course, I had to restart it as well.

I don't know if this is linked and/or if this can help but I figured you might be interested in knowing it.

Comment by Antoine Bonavita [ 31/Dec/15 ]

No, they are not. They are going to collections.

Comment by Daniel Pasette (Inactive) [ 30/Dec/15 ]

Are all the map/reduces inline?

Comment by Antoine Bonavita [ 30/Dec/15 ]

The map/reduce are run on the primary. And we are seeing more freezes on the secondaries than on the primary. Is it when the secondary is replicating the temp tables used by the primary during the mapreduce process ? If so, is there a way to deactivate the replication of those temp tables (after all they are just temp tables) ?

Comment by Daniel Pasette (Inactive) [ 30/Dec/15 ]

Hi Antoine, yes the map/reduce commands would do a lot of create and renames. That helps explain things.

Comment by Antoine Bonavita [ 30/Dec/15 ]

Hello Sue,

Please find attached the output of running ls -lR on our dbpath on the secondary for which you already got data for dec 18th.

Our code in itself does not create/drop collections. This being said, we do a lot of mapreduces (which should run on the primary) and aggregations (which should run on the secondary) and some of them are so big they require creation of temporary tables. Could that explain it ?

Comment by Susan LoVerso [ 29/Dec/15 ]

Hello antoine-sticky, I took a look at the server status log from 18 Dec. Can you attach an ls -l of your database directory? It may help us correlate some things I see in the stats.

One thing I notice is that at the time that the problem appears to happen, the number of commands for drop, create and renameCollection appears to go up by an order of magnitude over its average number. Is there something in your code that manipulates a lot of collections or indexes on a daily basis or anything else you can think of with those commands?

Comment by Antoine Bonavita [ 29/Dec/15 ]

Hello Alexander,

The workload characteristics do not change. We have seen the problem happen on primaries and secondaries without distinction and sometimes at "low usage" hours. Our code does not run listIndexes directly. What could cause it ?

Comment by Alexander Gorrod [ 28/Dec/15 ]

antoine-sticky I've taken another look at the statistics you uploaded. It appears as though a listIndexes command happens right before the system transitions to the low throughput state. Does your workload change characteristics around the time the performance profile changes?

Comment by Alexander Gorrod [ 21/Dec/15 ]

Thanks for the statistics, I'm reviewing them now, and attempting to generate a reproducer as well.

Comment by Antoine Bonavita [ 18/Dec/15 ]

It happened on one of the secondaries. Attaching output of server status and mongostat. Please note the server is in Central European Time (so GMT+1 currently).

Let us know if you need more.

Comment by Antoine Bonavita [ 18/Dec/15 ]

OK. Started capture on the secondaries as well. So, I'll send you the data from whoever fails first.

Comment by Alexander Gorrod [ 18/Dec/15 ]

Thanks. The server status should come from the mongod where the symptom is present. I'd like to correlate the mongostat cache usage statistics with the statistics from server status. If the symptom is present on a secondary, it would be fine to capture all statistics there.

Comment by Antoine Bonavita [ 18/Dec/15 ]

I restarted a capture on the primary with a more robust approach.
We have the same behavior on the secondaries. Is it OK if the serverstatus capture comes from one of the secondaries ?

Comment by Alexander Gorrod [ 18/Dec/15 ]

Thanks for the data. I agree that the swap usage looks OK.

Unfortunately, I'd really need the server status log information covering the period where the cache usage increases from ~80% to 100% to have a good chance of diagnosing the problem. Could you try again with the data capture? I don't need to see the free information again.

Comment by Antoine Bonavita [ 17/Dec/15 ]

Hello,

Our primary just exhibited the problem. Here (http://cdn.stickyadstv.com/one-shot/17dec2015-mongo-logs.tar.gz) is a tar.gz containing:

  • free_mongo.log - output of free -h since last midnight.
  • iostat_mongodb.log - output of iostat -k -t since last midnight
  • mongostats.log - output of mongostat since last midnight
  • ss_mongodb.log - output of db.serverStatus( {tcmalloc:1}

    ). Unfortunately, we don't have this since last midnight as the process crashed. This is for a few seconds when the server was at 100%.

Sorry, but I could not upload the file on JIRA.

Let me know if you need more info to investigate.

A.

Comment by Antoine Bonavita [ 16/Dec/15 ]

It happened on one of the secondaries. Attaching output of free, mongostat and serverStatus.

Comment by Antoine Bonavita [ 14/Dec/15 ]

Hello Alexander,

Thanks for looking into this. Swap looks good to me at this point (but the system is fine):
total used free shared buffers cached
Mem: 62G 62G 854M 0B 136M 33G
-/+ buffers/cache: 28G 34G
Swap: 19G 109M 18G

I'm logging those values every second and will post values I get when the system reaches 100% cache.

Comment by Alexander Gorrod [ 14/Dec/15 ]

Hi antoine-sticky I have reviewed the log files, and I can't diagnose any issues that would cause a stall. Can you check to see whether the system is using swap memory? To do that I'd use the Linux free command.

Comment by Antoine Bonavita [ 14/Dec/15 ]

Hello,

I work with Anthony and I'm taking over as he is on holiday this week. We keep seeing the problem. Is there anything we can do to help diagnose ?
Or settings we should try ?

Comment by Anthony Pastor [ 08/Dec/15 ]

Updated graph (20151208) displaying wt %cache used.

Comment by Anthony Pastor [ 08/Dec/15 ]

Hi,

I've attached the logs (logging_20151208.tar.gz) and an updated graph (mongodb_wt_cache_usage.png)

Comment by Anthony Pastor [ 07/Dec/15 ]

Hi Ramon,

This is a ReplicaSet (We use a sharded cluster with 2 ReplicaSets).
On our Replicaset we have one primary and two secondaries.

I've started the log collection for serverStatus, iostat & mongostat.
I'll send it when the issue will happen again.

Comment by Ramon Fernandez Marina [ 07/Dec/15 ]

Hi anthony.pastor, can you provide some more details about this deployment? Is this a stand-alone node or a replica set?

Also we'll need more information to investigate this issue. Can you please collect serverStatus logs for the affected node from a restart until you see the issue? Same drill as for SERVER-20159, but please keep the 1-second interval.

Thanks,
Ramón.

Comment by Anthony Pastor [ 07/Dec/15 ]

This graph display the %use (WiredTiger cache in use) on our MongoDB nodes.

Generated at Thu Feb 08 03:58:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.