[SERVER-46902] Single-server replica set responds about 3 times slower than a standalone server on the same hardware Created: 16/Mar/20  Updated: 27/Oct/23  Resolved: 24/Mar/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andre M Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongo-rs-vs-sa-insert.js     PNG File mongod-excessive-journal-flush.png     PNG File mongodb-200k-ins-only-rs-sa-perfmon.png     PNG File mongodb-80k-find-ins-rs.png     PNG File mongodb-80k-find-ins-sa.png    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

I cannot share the application code that demonstrates the 3 times difference and running Mongo Shell script doesn't show as much of a difference. Not sure if you will be able to reproduce the same.

Anyway, what I did is this:

On Windows, add and remove these two lines in mongod.cfg between tests:

replication:
  replSetName: rs0 

Each time after changing these lines, I ran the application test or the attached test script (not in the same server run).

The test script attempts to mimic the application calls, but for much smaller documents and for a real application test find commands are running against a different collection.

Participants:

 Description   

This might be a Mongo C driver issue, not a database issue, but since the only variable changed between my tests is the server configuration, I created this under SERVER.

I am running Mongo DB v4.2.1 on Windows and the Mongo C driver v1.16.1-1.el8 on CentOS 8. This is a development machine, so the Mongo DB server is configured to run as a replica set with a single server. CentOS is running in HyperV, but looking at the CPU usage in VTune profiling results below, this is not caused by overloading CPUs.

What I noticed is that when I run an application test against this server configured as a standalone server, the test runs about 3 times faster than when it is configured as a replica set.

I tried to reproduce the same problem with a small Mongo Shell script and while it doesn't show the same difference, the test against a replica set still runs about 1.3-1.5 times slower against a replica set (~32s vs. ~48s). Sometimes the difference is as little as 1.2 times, but never as close as on Windows. The same test running Mongo Shell on Windows differs only about 1.1 times and is more consistent in terms of variance.

The two VTune screenshots below are bottom-up stacks showing the same application test doing about 80K look-ups by _id and single-document inserts in a database that contains about 2M documents. Same database in both tests. The only change is adding and removing replSetName in the configuration file and restarting the database server.

This VTune screenshot is against a standalone server. The bottom two green lines show CPU usage for application 2 threads interacting with Mongo C. You can see how each is using about 50-60% of CPU and the test runs for 90 seconds. The wait time in the poll function is 61.8 seconds for both threads combined.

This VTune screenshot shows the exact same application test running against the exact same Mongo DB server, but configured as a replica set. Notice on the CPU usage scale that the duration of the test is now 260 seconds and CPU usage for each application thread is barely 25%. The wait time now is 418 seconds for both threads combined.

The call stack is exactly the same for both screenshots.

Looking at Mongo C driver source, I see that _mongoc_socket_wait uses a fixed timeout, which is consistent with VTune profiling results, but I'm not sure if it's the server that changes send/receive timing or it's the client that polls differently. I didn't have a chance to profile the source.

Let me know if you want me to try anything in my configuration.



 Comments   
Comment by Andre M [ 13/Apr/20 ]

I guess I just have to wait until somebody at MongoDB realizes that addressing this issue drastically improves server performance for pretty much everyone who pumps a lot of data into MongoDB. In my setup the difference was 3-4 times, but that's just the limitation of my test system - a standalone server did utilize most of allocated resources, while a replica set was able to utilize only a fraction of CPU and disk throughput capacity because of the way it flushes the journal. This difference will likely grow as more hardware is allocated to server machines.

Disappointing, really, but I suppose this is as much as one can plead for an issue. I will leave it alone now. My apologies for being persistent.

Comment by Andre M [ 31/Mar/20 ]

Can you please change the incorrect and misleading "Community Answered" resolution to "Won't Fix", which would at least reflect accurately that it is still an issue, even though Mongo DB is just not willing to fix it?

Comment by Andre M [ 26/Mar/20 ]

Is that it for this issue? If it is, it's like putting square wheels on a Ferrari - the car is beautifully designed for hundreds of km/hr, but because they were out of good wheels users have to settle for a trade-off and stay off a race track.

Comment by Andre M [ 24/Mar/20 ]

Thank you for jumping in, Geert. It is much appreciated.

One can write the same file from multiple threads at specified offsets with seek and write calls to ensure order and durability. Flushing cache with FlushFileBuffers makes mongod absolutely non-scalable - no matter how much hardware you throw at it, it writes at 3 MB/s and uses 12-30% of a single core. This doesn't sound to me as a good work-around, any way you look at it.

Having said that, I'm not trying to start a technical discussion of what is doable and what is not - I'm not familiar with mongod specifics to offer an opinion beyond what I have expressed in this thread. I can only add that transaction logs are not something new and other database servers handle them without exhibiting this pattern - they get more hardware and they run faster.

With mongodb, there is really no way to resolve it, other than either trying to come up with some infrastructure work-arounds, such as running in-memory engines in front, or some development hacks, like trying to come up with a journal in front of a standalone mongod instance.

I'm really hoping you guys revisit this issue. It will not go unnoticed and everyone running mongod under high load will benefit tremendously from such solution.

Comment by Geert Bosch [ 24/Mar/20 ]

We need to ensure not just that all writes are durable, but also that all writes become durable in the correct order. It's not OK for a primary to perform Write1, Write2 and Write3 and secondaries to replicate those writes if the primary only has Write1 and Write3 durable. So, we have to make a trade-off between how often we sync the journal and replicaset lag. Similarly, there is a trade-off between total throughput of many clients and throughput of a single client. We have primarily focused on many clients in replica sets with at least 3 nodes. We have made improvements in MongoDB 4.4 to allow replication before journaling while still honoring durability guarantees. This should help all replica sets regardless of number of nodes.

Comment by Andre M [ 24/Mar/20 ]

I find it eyebrow-raising that this severe bug drastically affecting mongod performance is closed. After the actual cause of the performance degradation was identified, I might add.

This has nothing to do with community support. Can you please bring development and architecture into this?

Comment by Dmitry Agranat [ 24/Mar/20 ]

Hi cis74633@bell.net,

Thank you for executing additional tests. If you need further assistance with performance tuning of your cluster running on Windows, I encourage you to ask our community by posting on the MongoDB Community Forums or on Stack Overflow with the mongodb tag.

Do you mind elaborating what results the server is waiting for? There are no secondaries in this case for that to be the network, cannot be data, as it's the same as in the standalone case, which leaves only the journal.

There are other components involved with the replication, I've mentioned some examples in this comment. We also have several internal projects aiming to improve the performance in this area with MongoDB 4.4.

Thanks for creating SERVER-47066 to investigate the reported memory leak. I will go ahead and close this ticket and we'll start looking at SERVER-47066 shortly.

Regards,
Dima

Comment by Andre M [ 23/Mar/20 ]

I think I see the problem. I profiled mongod in VTune and I can see that a lot of wait is in Win32 completion I/O. I ran procmon against the journal file and I can see multiple threads stomping on each other while trying to flush journal writes.

This is a pretty brutal way to ensure file writes are durable. Microsoft highlights it right on the FlushFileBuffers page (see in Remarks):

https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-flushfilebuffers

Out of interest, I wrote a small test program that writes 1K chunks from 4 threads using FlushFileBuffers and I see the same poor disk write performance at about 7 MB/s as with mongod (mongod writes at about 3 MB/s, but it does more than my test does). In this case 4 threads finished writing 100 MB file (same size as in journal) in 115 seconds.

Removing FlushFileBuffers and using write-through flags finishes the same test in less than 10 seconds and the write speed is 61 MB/s, while keeping journal durable.

This is a fairly benign fix - it just adds a couple of flags and removes the explicit flush call, but I think it will make a lot of difference in performance on Windows.

Comment by Andre M [ 23/Mar/20 ]

Thank you, Dima

please open a new ticket

Will do

If we are just waiting for a certain result to be propagated back to the source, HW resources utilization is usually decreased.

Do you mind elaborating what results the server is waiting for? There are no secondaries in this case for that to be the network, cannot be data, as it's the same as in the standalone case, which leaves only the journal. Reading how Mongo DB handles journal, it appears that it does keep clients separate and uses threads to operate on the journal buffer, which implies that it should not be bottlenecked on waiting for journal writes. Not sure what else is there to wait for.

It is a closed system and whatever it is waiting for must be observable in some way - long network communications, long disk writes, etc., and, consequently, remedied by reducing those obstacles.

Regarding the first bullet, this is expected.

Looking at Mongo C design, I see elaborate performance-conscious design decisions throughout the codebase. Assuming that mongod is as carefully and elaborately designed, I'm quite puzzled that it would be the official MongoDB stance that a replica set cannot utilize more than 30% of one CPU core and 10% of available disk write bandwidth because it bottlenecked on some resource. There must be more to this than just it-is-what-it-is.

I am hoping the answer to the result wait question would shed some light on this puzzle.

 

 

Comment by Dmitry Agranat [ 23/Mar/20 ]

Hi cis74633@bell.net,

Regarding the first bullet, this is expected. If we are just waiting for a certain result to be propagated back to the source, HW resources utilization is usually decreased.

Regarding the second bullet, yes, please open a new ticket and we'll be happy to take a look.

Thanks,
Dima

Comment by Andre M [ 22/Mar/20 ]

I see two issues here:

  • Replica set performs 3-4 times slower than a standalone server without any visible evidence that any of the system resources is maxed out (CPU 12%, disk writes 2.3 MB/s)
  • Server memory is leaking at 1.2 MB/s in this test and I think I saw higher rates in other similar tests

This test does just unrelated inserts, so there is no server-side locking should be going on against any of those documents and for similar conditions with other database servers I can throw more hardware at them for those maxed out resources and have them run faster. From the data above it appears that Mongo DB has some internal locking that prevents it from using all available machine resources and performing better.

A memory leak is self-explanatory, I hope. I can create a separate bug for the memory leak, if you'd like.

Comment by Dmitry Agranat [ 22/Mar/20 ]

Hi cis74633@bell.net,

Thank you for this additional information. Apart from the expected difference between a standalone and a replica set, could you please clarify what MongoDB bug you'd like us to investigate?

Thanks,
Dima

Comment by Andre M [ 20/Mar/20 ]

I ran one type of an application test against a standalone Mongo DB server and against a single-server replica set. Just to be clear, I don't particularly care about single-server replica set - it is just used for testing, but I am assuming the performance of an actual replica set with multiple servers will follow the suit.

In each test 2 concurrent threads inserted 100K documents each (i.e. 200K docs in total). No queries or updates per document. The test is running against a real application, so it's not just a specialized test with Mongo C client pounding away the database with unrealistic patterns, but rather a real application being tested via JMeter.

The test against a replica set runs for 595 seconds and against a standalone server for 134 seconds. That is 4.4 times difference.

The perfmon shows that none of the system resources is in short supply - Mongo DB service is using just 30% CPU at start and then drops to about 12% for the replica set (see remark below the chart), disk writes are at 23 MB/s and CrystalDiskMark shows that even with a single queue and thread this disk can do random writes at 84 MB/s, but I imagine the oplog and other additional bits of data the server writes are sequential writes, so the disk write speed would be in the order of magnitude higher.

It is just very hard to believe that a server that is as unstressed as the perfmon shows below is the expected behavior. Am I missing something?

A couple of other points to highlight in the chart above. The line for Private Bytes indicates there is a memory leak. You may want to look into that. It's a pretty steep curve and the server will chew through all RAM pretty quickly at this rate.

The drop in chart counters (i.e. network packets, CPU usage, disk writes) in the replica set test cannot come from the application - it's the same application running the same test in both cases and there is no change in the pattern in the standalone test, so it must be something in mongod. I restarted the server for each test and it is possible that this drop would be at the beginning of every server start-up, but I didn't run an extended test yet to verify. For now I'm assuming the counters that follow the drop are the actual counters I can expect for a long-term run.

Comment by Andre M [ 19/Mar/20 ]

Thank you for a quick investigation on this, Dima.

I do expect a replica set to operate slower than a standalone server, but 3+ times slower seems a bit extreme, especially when there is really not much to improve on the hardware side of things - everything is running with plenty of CPU cores, memory and SSD disks. Can you advise any configuration values I can tweak for a replica set (in general, not just for a single-server replica set) that may visibly improve performance?

Looking at the CPU and disk usage during the test, none of these is saturated. I will look closer into the network traffic, but I don't think it was saturated either, which means there is some kind of polling with a fixed-time wait going on somewhere on the server side. In other words, I totally understand having to write into the oplog, journal, etc, but this would translate into more CPU, disk and network usage that the client would be waiting on, which is expected, but if none of those is saturated, it seems strange.

I will run another test with perfmon set up for all of those counters to confirm (or reject) the thought above. If you could keep the issue open just a bit longer, it would be much appreciated. Thanks!

Comment by Dmitry Agranat [ 19/Mar/20 ]

Hi cis74633@bell.net, thank you for the report.

Even with a single member replica set, we still have most of the replication components enabled. For example, additional work is required with the oplog, journal and retyable writes when the replication is enabled. This is especially true with a write-heavy workload.

Though we have plans to introduce some improvements to a single member replica set, the overall observations mentioned here are expected.

Thanks,
Dima

Comment by Andre M [ 18/Mar/20 ]

One thing to mention is that about half-way through the application test response time decreases. It seems to coincide with the decreased part of CPU usage in the VTune screenshot at the bottom.

Comment by Andre M [ 18/Mar/20 ]

I uploaded a file called mongodb-SERVER-46902.zip. There are two folders in the file, one for each test:

  • standalone-9-49-edt-test-52-sec
  • replica-set-9-53-edt-test-3-min-51-sec

I stopped and started MongoDB server for every test, copied the log file and diagnostic data after each test and deleted them, so test data are independent. It sounds like you are doing what perfmon does, but if you need some additional perfmon counters, let me know and I will capture those as well.

The application test in both cases was identical - the application mostly ran one query and one insert, and every 100 requests did about 100 updates with a query. The standalone test started at 9:49 EDT and ran for 52 seconds and a replica set test started at 9:53 EDT and ran for 3 minutes 51 seconds.

Average application response times for the standalone test were about 4 ms and for a replica set test about 16 ms, which I mention just for timing scale, as it has a mix of application and database calls and are meaningless on their own.

Comment by Dmitry Agranat [ 18/Mar/20 ]

Hi cis74633@bell.net,

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Please do the above for each test, one for a test with a standalone note and another one for a test with a single-server replica set. Please also mention the exact time each test was executed.

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by Andre M [ 16/Mar/20 ]

Got conflicting edits on the title. It's supposed to say this:

Single-server replica set responds about 3 times slower than a standalone server on the same hardware

Generated at Thu Feb 08 05:12:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.