[SERVER-24775] Degraded performance in 3.2.7 (probably after 3.2.5) - GlobalLock? Created: 24/Jun/16  Updated: 21/Mar/18  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 3.2.5, 3.2.6, 3.2.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Lucas Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 8
Labels: RF
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 3.2.4100uptime.png     File diagnostic.7z     File mongologs.7z     PNG File s24775.png     PNG File updatetry.png    
Operating System: ALL
Steps To Reproduce:

Upgrade MongoDB to 3.2.7 from 3.2.4. I have not tested in versions 3.2.5 and 3.2.6 but if SERVER-22964 is the cause probably 3.2.5 and 3.2.6 have same problem.

Participants:

 Description   

Hello.

I have a software (a realtime analytics) that uses MongoDB with heavy reading/writing (millions per day). This software needs to make aggregations/counts/finds based on several filters (we've around 10-40 indexes in each collection) and we have many distributed machines inserting and updating data.
The application has several collections (about 700), with up to 8 millions documents, but most of the collections have less than 10 thousand documents. The amount of collections grows depending on the application use, but it is rarely created more than 5 collections in a day.
This application doesn't have any collection drop (only very few exceptions).

My present architecture in ReplicaSet - that have only MongoDB running:

	1 Primary (Dedicated):
		OS: Ubuntu 14.04.4 LTS
		RAM: 128GB DDR4 ECC 2133
		CPU: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz 12 cores
		DISK: 2x480 GB SSD - 960GB software RAID 0 (ext4)
		MongoDB version: v3.2.4 with WiredTiger
		Network: 1Gbps
	1 Secondary (Dedicated):
		OS: Ubuntu 14.04.4 LTS
		RAM: 128GB DDR4 ECC 2133
		CPU: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz 12 cores
		DISK: 2x480 GB SSD - 960GB software RAID 0 (ext4)
		MongoDB version: v3.2.4 with WiredTiger
		Network: 1Gbps
	1 Arbiter (Virtual)
		OS: Ubuntu 14.04.3 LTS
		MongoDB version: v3.2.0 WiredTiger + nojournal

I have been using MongoDB in this application for a long time and I've never had the problem I will say here.
Recently we realized that our database was growing and the two disks at RAID 0 wouldn't support for a long time. My infrastructure team started to see other machines so that we could migrate our database at our last horizontal scale, so we would be able to plan and start our shard process.

The problem: all machines we tested showed a VERY degraded performance. With little runtime (maximum 5 hours) ALL operations began to take a very long time to run (operations that once run on 110m, was taking 20000+ms) and this made our application unusable.

I would firstly like to clarify that strict hardware tests (disk, CPU, RAM, etc) were done in all machines by my infrastructure team and they had no errors.

Here are the machines we've tested and their results:

Machine 1:
	OS: Ubuntu 14.04.4 LTS
	RAM: 128GB DDR4 ECC 2133
	CPU: Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50GHz 12 cores
	DISK: 4x800 GB SSD - 3200GB hardware RAID 0 (ext4)
	MongoDB version: v3.2.7 with WiredTiger
	Network: 1Gbps

Result: degraded performance. We think that the cause could be 4 disks with RAID 0 (MongoDB production notes sugests RAID 10). Then we tested the second machine:

Machine 2:
	OS: Ubuntu 14.04.4 LTS
	RAM: 256GB DDR3 2Rx4
	CPU: 2x 2GHz Intel Xeon-SandyBridge (E5-2620-HexCore)
	DISK: 6x1.2 TB GB SSD - 3.3T hardware RAID 10 (ext4)
	MongoDB version: v3.2.7 with WiredTiger
	Network: 1Gbps
  
Machine 3:
	OS: Ubuntu 14.04.4 LTS
        RAM: 512GB RAM DDR3
        CPU: 4x Xeon E7-4850 (10 cores 2GHz)
        Disk: 6x1.2 TB SSD - 3.3T hardware RAID 10 (ext4)
        MongoDB version: v3.2.7 with WiredTiger
        Network: 1GBps

Same result. Here we started to think that the cause could be the RAID type (hardware) or the high number of disks used. We even believed that our application needed more RAM/CPU (then we test the third machine), but it wasn't.

After that, we tested the fourth machine (although not with SSD):

Machine 4:
	OS: Ubuntu 14.04.4 LTS
	RAM: 256 GB DDR4 ECC 2133
	CPU: 2x Intel Xeon E5-2660v3 20c/40t 2.6/3.3 GHz
	DISK: 2x4 TB SAS - 8 TB software RAID 0 (ext4)
	MongoDB version: v3.2.7 with WiredTiger
	Network: 1Gbps

This one degraded even faster (by being SAS disks) and didn't work as expected too.

Now we tried a fifth machine!! This machine have the same configuration as our current one (present architecture), differing only by disks size and by the hardware RAID. Result: degraded performance. This one we let running for 17 hours and 19 minutes and created a 313MB logfile (many requests above 100ms).

At this point we became confused about what could be. We tried lots of different machines and all had the same result. We conclude two things: MongoDB can't run in hardware RAID and our application demand more IO that one SAS can provide (which was expected but was nice to test).

I was verifying MongoDB logs and I noted all operations were taking a long time to execute.

For example, one grep in a single command (one of the fastest in my current architecture) in the fifth machine with degraded performance:

2016-06-19T18:14:51.401-0500 I COMMAND  [conn1430] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6339285 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 14257ms

Collection stats:

{
        "ns" : "shipyard.current_universes",
        "count" : 515,
        "size" : 1323315,
        "avgObjSize" : 2569,
        "storageSize" : 1486848,
        "capped" : false,
        "nindexes" : 4,
 				"totalIndexSize" : 188416,
        "ok" : 1
}

It is quite obvious this type of request shouldn't take so long in this collection.

This timeAcquiringMicros in global lock was unbelievable. After checking the MongoDB release notes and in revision 3.2.5 has changes in Global Lock system (SERVER-22964) and I think this is the cause.

Now, a grep from a part of the log that can show the increase over time (just few seconds!!) of the response time. I focused in this request because it is fast but this happens in the entire log and operations (I will upload it here):

2016-06-19T18:14:25.230-0500 I COMMAND  [conn1551] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 177ms
2016-06-19T18:14:25.230-0500 I COMMAND  [conn1547] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 175ms
2016-06-19T18:14:25.231-0500 I COMMAND  [conn1535] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 174ms
2016-06-19T18:14:25.472-0500 I COMMAND  [conn1577] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 714ms
2016-06-19T18:14:29.148-0500 I COMMAND  [conn1431] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 211664 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 3046ms
2016-06-19T18:14:29.148-0500 I COMMAND  [conn1533] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 382ms
2016-06-19T18:14:29.148-0500 I COMMAND  [conn1427] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2985ms
2016-06-19T18:14:29.148-0500 I COMMAND  [conn1528] query shipyard.current_universes query: { account_id: 5389779860455424, universe_id: 6100803039264768 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:2481 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2984ms
2016-06-19T18:14:29.149-0500 I COMMAND  [conn1428] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2984ms
2016-06-19T18:14:29.149-0500 I COMMAND  [conn1429] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2984ms
2016-06-19T18:14:29.149-0500 I COMMAND  [conn1577] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6688873937633280 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:653 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2308ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1432] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2328ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1430] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2310ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1536] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2308ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1426] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2311ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1576] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2986ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1529] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2310ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1516] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2986ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1534] query shipyard.current_universes query: { account_id: 4734030863925248, universe_id: 4965384511815680 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5715 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2637ms
2016-06-19T18:14:29.153-0500 I COMMAND  [conn1433] query shipyard.current_universes query: { account_id: 6652771465953280, universe_id: 6024704036962304 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:3124 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 1378ms
2016-06-19T18:14:32.214-0500 I COMMAND  [conn1529] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 609ms
2016-06-19T18:14:32.215-0500 I COMMAND  [conn1543] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 464ms
2016-06-19T18:14:32.215-0500 I COMMAND  [conn1577] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6688873937633280 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:653 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 563ms
2016-06-19T18:14:32.215-0500 I COMMAND  [conn1516] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 563ms
2016-06-19T18:14:32.215-0500 I COMMAND  [conn1537] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 577ms
2016-06-19T18:14:32.215-0500 I COMMAND  [conn1426] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 609ms
2016-06-19T18:14:32.216-0500 I COMMAND  [conn1434] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 606ms
2016-06-19T18:14:34.233-0500 I COMMAND  [conn1534] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 1953ms
2016-06-19T18:14:34.233-0500 I COMMAND  [conn1516] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 1948ms
2016-06-19T18:14:34.618-0500 I COMMAND  [conn1543] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2310ms
2016-06-19T18:14:35.388-0500 I COMMAND  [conn1551] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 1902ms
2016-06-19T18:14:35.389-0500 I COMMAND  [conn1576] query shipyard.current_universes query: { account_id: 6652771465953280, universe_id: 6024704036962304 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:3124 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2673ms
2016-06-19T18:14:35.488-0500 I COMMAND  [conn1433] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 2764ms
2016-06-19T18:14:39.588-0500 I COMMAND  [conn1426] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 4899ms
2016-06-19T18:14:39.588-0500 I COMMAND  [conn1554] query shipyard.current_universes query: { account_id: 5766858980458496, universe_id: 5519831352737792 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:736 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 5301ms
2016-06-19T18:14:39.588-0500 I COMMAND  [conn1535] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 4899ms
2016-06-19T18:14:39.590-0500 I COMMAND  [conn1428] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 4908ms
2016-06-19T18:14:51.401-0500 I COMMAND  [conn1430] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6339285 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 14257ms
2016-06-19T18:14:51.401-0500 I COMMAND  [conn1551] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6337315 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15440ms
2016-06-19T18:14:51.401-0500 I COMMAND  [conn1516] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6403331475898368 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:670 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 6337664 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15849ms
2016-06-19T18:14:51.401-0500 I COMMAND  [conn1576] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6340964 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15970ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1531] query shipyard.current_universes query: { account_id: 5389779860455424, universe_id: 6100803039264768 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:2481 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6340091 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15181ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1433] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 6340505 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15846ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1432] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 6341490 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15950ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1573] query shipyard.current_universes query: { account_id: 6566762430595072, universe_id: 6688873937633280 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:653 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 6340672 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15950ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1537] query shipyard.current_universes query: { account_id: 5888892137897984, universe_id: 6728455165050880 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:916 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 6339975 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15949ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1534] query shipyard.current_universes query: { account_id: 5704147139559424, universe_id: 5942933362573312 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:5090 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6340494 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15580ms
2016-06-19T18:14:51.402-0500 I COMMAND  [conn1527] query shipyard.current_universes query: { account_id: 5345363120619520, universe_id: 6278419935395840 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:1931 locks:{ Global: { acquireCount: { r: 4 }, acquireWaitCount: { r: 1}, timeAcquiringMicros: { r: 6336434 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 15950ms

What did I do after that? I used the fifth machine and I went back to MongoDB 3.2.4. There it is!! Problem solved. MongoDB perfectly works in Hardware RAID. And probably all machines tested before (with the exception of the SAS) would work with the performance we were used to.

We try to keep the most recent revision of our release (MongoDB production notes/checklist sugestions) and I hadn't thought that it could have such a sudden drop in performance after updating a REVISION (not a entire version) of MongoDB.

I'm sure that the problem wasn't with my indexes or with my documents, since the fifth machine is working perfectly fine with MongoDB 3.2.4 and it has always worked in my current architecture.

Just to prove. Here is the longest request of the same type showed in previous logs after downgrading to MongoDB 3.2.4 at the fifth machine (3 days running so far and 67MB of logfile, as primary node):

2016-06-23T22:47:37.752-0500 I COMMAND  [conn599] query shipyard.current_universes query: { account_id: 5335678606901248, universe_id: 5368858286227456 } planSummary: IXSCAN { universe_id: 1, account_id: 1 } ntoskip:0 keysExamined:1 docsExamined:1 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:638 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 } }, Collection: { acquireCount: { r: 2 } } } 116ms

I know we process some slow aggregations (rarely 200 secs), but not in the same level that was at the machine 5 with MongoDB 3.2.7.

I don't want to get stuck in MongoDB 3.2.4 and I want, at least, an explanation about what is happening in MongoDB 3.2 revision 5+ that almost DESTROYED my entire application because this happened in my production set at the first few times.

Lastly, I would like to apply myself to be a tester of new versions of MongoDB (mostly versions of ReplicaSet in high performance systems).



 Comments   
Comment by Ramon Fernandez Marina [ 12/Feb/18 ]

I don't think there's anything actionable in this ticket so I'm going to close it for the time being. The 3.2 branch is up to 3.2.19, and the 3.4 branch has received multiple performance improvements since – any further performance issues against latest versions can be investigated in separate tickets.

Regards,
Ramón.

Comment by Lucas [ 05/May/17 ]

Hello anonymous.user. Unfortunately I can't upgrade this particular instance because is a prodution environment..
I will transfer this cluster to another one in order to resolve any kind of performance issue. For now we will stay in 3.2.4 here.

Sorry. I really can't upgrade this one right now.

Comment by Kelsey Schubert [ 04/May/17 ]

Hi lucasoares,

Please note that MongoDB 3.2.13 contains the fixes mentioned above as well. If this is still an issue, would you please upgrade to the latest version of MongoDB 3.2 or 3.4 and let us know whether it resolves the problem?

Thank you,
Thomas

Comment by Kelsey Schubert [ 28/Mar/17 ]

Hi lucasoares,

Today, we released MongoDB 3.4.3. This release includes a number of improvements to WiredTiger. In particular, I believe the following tickets are likely to significantly improve the behavior you are observing:

  • WT-3115 improve concurrency for accessing the list of trees in cache
  • WT-3139 enhances our workload generator to stress individual tables
  • WT-3148 improves eviction efficiency with many idle trees in cache
  • WT-3150 reduces the impact of checkpoints on eviction efficiency

Users in SERVER-27700 and SERVER-27753 have seen improvements after upgrading to take advantage of these code changes, and our internal benchmarking likewise has shown promising results. In addition, MongoDB 3.4.3 includes additional diagnostic information that helps us more quickly identify the root cause of issues.

Would you please upgrade to MongoDB 3.4.3 and confirm if it resolves the issue?

Thank you,
Thomas

Comment by Lucas [ 11/Nov/16 ]

Hello anonymous.user.

We tried to upgrade our enviroment to 3.2.10 but unfortunatly the performance isn't the same as in 3.2.4.

I think the problem is in our database size. We have around 1T splitted in hundreds of collections, each one with dozens indexes, but only 256G in memory. We know that we need a sharded cluster, but this is a work in progress..

At 3.2.4, I think that when the memory is full to store cache data, all CPU cores stagnates in 100% (but not really working in something) and the database becomes inaccessible. When this occurs we need to restart our MongoDB server. Has a few days that this doesn't happens because we significantly decrease the size of the databases. But this is something that doesn't happens all time.

In version 3.2.10 the entire database isn't usable. Queries of 100ms in 3.2.4 are processed in dozens of seconds in 3.2.10 after some time.

We know that we have a large database to a single mongod process, but at version 3.2.4 it works nice even with those problems. In other hand, at the version 3.2.10 nothing really works..

My team are insecure to update our enviroment, because we don't know if really is the size of database that are causing these performance issues. We have another replica set with 1/2 of this data and is stable all time.

I uploaded two images:


Server with 1/2 of data and 100% uptime (3.2.4).


We tried to update to 3.2.10 (and in this time, the application was offline all time since all operations was processed in dozens of seconds).

I will try to execute iostats in both versions to help you, but only in wednesday.

Comment by Kelsey Schubert [ 03/Oct/16 ]

Hi lucasoares,

We have not succeeded in creating a workload which reproduces this issue.

However, in MongoDB 3.2.10, we have made significant improvements to the performance of WiredTiger. In particular, SERVER-25070, which sallgeud mentioned previously on this ticket is likely to be resolved by the fixes in MongoDB 3.2.10.

If this is still an issue for you, would you please consider upgrading to MongoDB 3.2.10 or provide diagnostic.data and iostats as I requested above?

Thank you,
Thomas

Comment by Kelsey Schubert [ 28/Jul/16 ]

Hi lucasoares,

Unfortunately, from the data you have provided we have not yet been able determine whether a backport has caused the regression you have reported.

We are attempting to create a workload which reproduces this issue to help us identify the root cause.

I understand that it takes time to run the tests that Bruce requested. But, if you or others have observed the same issue and are able to provide additional details, we will be able to better understand the cause of this behavior, and make progress towards a resolution more quickly.

Specifically, please upload iostats, diagnostic.data, and the MongoDB logs, running the same workload on the same machine, first on an unaffected version and then then on the affected version of MongoDB.

To collect iostats, you can run the following script:

delay=1
iostat -k -t -x ${delay:?} >iostat.log &

This will help us to correlate the CPU numbers to events recorded in the diagnostic.data and MongoDB logs.

Thank you,
Thomas

Comment by Lucas [ 15/Jul/16 ]

Yes. My node are running perfectly as primary in Mongo 3.2.4 since 2016-06-23 01:41:52.

2016-06-19T18:14:51 - this date (from log) is an example when this node is running as primary in MongoDB 3.2.7 with performance problems.

Before downgrading MongoDB to 3.2.4 this node was the primary at least for some time (from where the log files with performance issue came).

Comment by Alexander Gorrod [ 15/Jul/16 ]

lucasoares We have been trying to see something helpful from the data you uploaded, but so far have not been able to. The life-cycle of the machine for which you uploaded diagnostic data appears to have been:

2016-06-20 06:53:44: Start of diagnostic data, running MongoDB 3.2.7. Running as a secondary.
2016-06-22 03:32:48: Node switched to MongoDB 3.2.4, running as a secondary.
2016-06-23 01:41:52: Node appears to switch from a secondary to a primary.

Can you confirm that you specifically looking for a problem while the node is running 3.2.7 - i.e the first phase from the diagnostic data?

Comment by Lucas [ 15/Jul/16 ]

bruce.lucas Did you read my last reply? The data are from the same machine =)

Comment by Chad Kreimendahl [ 14/Jul/16 ]

I think there's probably a strong correlation between this issue and ours: SERVER-25070

We appear to have found two pieces to the same puzzle.

Comment by Lucas [ 04/Jul/16 ]

Hi bruce.lucas,

These files are from the same machine, I assure you. What happened is that we changed both DNS and hostname when we changed the MongoDB version. Maybe older files than 2016-06-20T04-50-41 aren't (since that files came from another machine, all copied by scp) but mongo-endpoint7 is the same machine as mongo-endpoint3b.

It's hard to do it now. This costs both time and money. Maybe the fact that are the same machine can help? Because I really already uploaded the diagnostics files from 3.2.4 and 3.2.7 from MongoDB executions in the same machine.

Just assume that mongo-enpdoint7 is the SAME machine of mongo-endpoint3b.

Comment by Bruce Lucas (Inactive) [ 01/Jul/16 ]

Hi Lucas,

We've been looking into your data but are not yet able to identify a cause.

The log data that you provided for 3.2.7 is from a machine mongo-endpoint7, whereas the diagnostic.data is from a machine mongo-endpoint3b, so we won't be able to correlate the slow queries in the mongod log with the data in diagnostic.data in order to investigate the cause.

Did you observe any slow queries on mongo-endpoint3b while it was running 3.2.7, from about 06-20 to about 06-22? If you can upload that mongod log and point out the slow queries there we can take a look in the diagnostic.data at those times.

Also, it looks like the load on mongo-endpoint3b was substantially different, and much less, while it was running on 3.2.4, so that makes comparison difficult. Would you be able to run the same workload on a the same machine, both on 3.2.7 and 3.2.4, and then upload diagnostic.data and mongod logs from both runs, pointing out the problem spots (slow queries) for us to investigate? By running the same workload, and providing full coordinated data (mongod logs and diagnostic.data) for each run, we will have the best possibility of understanding the difference.

Thanks,
Bruce

Comment by Ramon Fernandez Marina [ 24/Jun/16 ]

Thanks for uploading the information lucasoares, we're investigating.

Comment by Lucas [ 24/Jun/16 ]

Here you are.

About RAID configuration, I'm sure it works. We have the machine (from where the logs came) that was with degraded performance in 3.2.7 (as well as several other machines) but perfectly working in 3.2.4. And it can't be a coincidence, since we tested so many machines (for example an identical to the current architecture).

Mongo.log file with degraded performance was: mongo.log.2016-06-20T04-50-41.
Probably the diagnostic file will have a similar date or later.

Comment by Ramon Fernandez Marina [ 24/Jun/16 ]

Thanks for the detailed report lucasoares, sorry you run into trouble with the upgrade.

In order to diagnose this problem we'll need more information: do you still have the dbpath for the MongoDB 3.2.7 primary? I'm interested in seeing the contents of the diagnostic.data directory, if you have that can you please upload it to the ticket?

If you no longer have this data we'll have to try to reproduce this behavior locally, but note that so far we haven't seen the behavior your describe.

Another thing I'd like to mention is that, when switching RAID subsystems, one has to pay attention to block sizes and read-ahead settings. Clearly if the same new box with hardware RAID performs well with 3.2.4 but not with 3.2.7 then we need to look into that, but I figured I'd mention it.

Thanks,
Ramón.

Generated at Thu Feb 08 04:07:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.