[SERVER-50443] When j:false, performance degradation from 4.0 to 4.2 for 1-node replica sets Created: 21/Aug/20  Updated: 07/Mar/22  Resolved: 07/Mar/22

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 4.2.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Martin Mlynář Assignee: Kelsey Schubert
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
is caused by SERVER-35113 Stable timestamp does not advance if ... Closed
Related
Operating System: ALL
Steps To Reproduce:

Tests are run using https://github.com/idealo/mongodb-performance-test.git:

java -jar ./latest-version/mongodb-performance-test.jar -m insert -o 10000000 -t 100 -db test -c perf -port 28888 -h mongoserver

When running 4.0.6 standalone server with:
{{docker run -ti -v /srv/mongo/test-mongo:/srv/mongodb/ --net host
mongo:4.0.6 mongod --bind_ip_all --port=28888 --storageEngine=wiredTiger
--dbpath=/srv/mongodb/ --journal --oplogSize=8192
--enableMajorityReadConcern=0}}

I got:
mean rate = 174014.88 calls/second

When I start as replica set (no other replicas added, just start with replSet and call rs.initiate():
{{docker run -ti -v /srv/mongo/test-mongo:/srv/mongodb/ --net host
mongo:4.0.6 mongod --bind_ip_all --port=28888 --storageEngine=wiredTiger
--dbpath=/srv/mongodb/ --journal --oplogSize=8192
--enableMajorityReadConcern=0 --replSet wtf}}

I got:
mean rate = 44243.62 calls/second

Huge drop, but fine, RS is enabled. Now let’s try it with 4.2.8:

{{docker run -ti -v /srv/mongo/test-mongo:/srv/mongodb/ --net
host mongo:4.2.8 mongod --bind_ip_all --port=28888
--dbpath=/srv/mongodb/ --enableMajorityReadConcern=0:}}

mean rate = 167455.01 calls/second

and same with replset enabled:
{{docker run -ti -v /srv/mongo/test-mongo:/srv/mongodb/ --net host
mongo:4.2.8 mongod --bind_ip_all --port=28888 --dbpath=/srv/mongodb/
--enableMajorityReadConcern=0 --replSet wtf:}}

mean rate = 10868.81 calls/second

Huge drop to 10kcalls/second.

 

I've made git bisect and identified these commits which drags performance down the most:

b2384a85f62431d88ac01067ae4d1d3dbb4fede5 mean rate = 103095,21 calls/second

4cee07d8a97bb0663e7bfbc3f2e1fbf539140adf is the first bad commit
commit 4cee07d8a97bb0663e7bfbc3f2e1fbf539140adf
Author: Judah Schvimer <judah@mongodb.com>
Date: Mon May 21 18:52:55 2018 -0400

SERVER-35113 Allow single voting primaries to advance stable timestamp even when last applied does not advance

:040000 040000 948822d607a4ae02c4b18c9eac471e123fae892d 66d95e2b3a7fbdfda5aff07b58799dc8af359f3e M jstests
:040000 040000 053e1a2ae25ef2e2599f2f89ad7a580a6509b2b3 cbf9ce44531ec2908fc8f5ee8cc0e569356dfffb M src

drags down to mean rate = 22841,24 calls/second

 

Second:

>>>>>> bcaa4eb01c mean rate = 37827,93 calls/second

e4b08de746ef4e472bfdd4790fdb2b89d5a62f1a is the first bad commit
commit e4b08de746ef4e472bfdd4790fdb2b89d5a62f1a
Author: Lingzhi Deng <lingzhi.deng@mongodb.com>
Date: Mon Apr 29 12:09:48 2019 -0400

SERVER-40462: Disallow transactions on shard servers with writeConcernMajorityJournalDefault = false

:040000 040000 e98acae85e9e9e09dc773dd464be513376791c96 bda8f41ecf04f82c70c8c2bc42dcdc3375b9e656 M jstests
:040000 040000 0819f92e0a939abe440863dcc060aa48b7fcb128 752ad757384381fc777490bbca2c4f570c7aeec4 M src

>>>>>>>> bcde4c4fd3 mean rate = 26577,37 calls/second

 

3rd:

86efdd6b047927d3a37577a6033bacc709c7ae89 is the first bad commit
commit 86efdd6b047927d3a37577a6033bacc709c7ae89
Author: Ben Caimano <ben.caimano@mongodb.com>
Date: Mon Dec 16 21:39:41 2019 +0000

SERVER-43945 Expose out of order latch acquisitions in serverStatus

This review does several related things:

  • Modifies the Mutex type to have one Identity object per invocation context
  • Adds a latchAnalysis field to serverStatus
  • Officially turns on CI-only failure for acq level violations
  • Changes a few acq level declarations to be more permissive

:040000 040000 cc0ee8580816f86084bcdab28f8ae92ec9da2bf1 a77afd98d1fa56c79b71de4ec7835e4905a9c664 M jstests
:040000 040000 c9d93da9281647969dea4d505776120c9262a4d9 c2c56acf209149bdede17b97e7c3aadc2c882d1d M src

574b62017cad311690eba012222459ff5e41e8c5 mean rate = 11348,16 calls/second

Participants:
Linked BF Score: 0

 Description   

Hi all,

I’m currently running mongodb 4.0.6 and playing with upgrade to 4.2.8. After some benchmarks I’ve noticed really strange and huge performance degradation.

Can anyone give me some insight what’s going on?

I've made



 Comments   
Comment by Kelsey Schubert [ 07/Mar/22 ]

Hi nextsux@gmail.com,

Sorry for the delay getting back to you. We have similar performance tests in our CI infrastructure, which do not show this regression. We also haven't had any other user reports indicating that this is a significant area of concern. Given the age of these releases, if we were able to identify the root cause why you are seeing a performance regression and fix would not be likely as it may risk destabilizing the branch.

Thanks,
Kelsey

Comment by Eric Sedor [ 21/Aug/20 ]

nextsux@gmail.com, as a quick note, j:true is going to end up implying at least w:1, from a client perspective. See here

Comment by Martin Mlynář [ 21/Aug/20 ]

I've done some more tests:

I'm using these docker-compose files for 4.0 and 4.2:

# cat mongo-4.0/docker-compose.yaml
version: "3.4"
services:
  mongo-rs0-0:
    image: mongo:4.0.6
    network_mode: host
    volumes:
      - /srv/mongo-4.0-rs0-0:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28881",
      "--replSet", "wtf",
      "--oplogSize", "8192",
      "--enableMajorityReadConcern", "0"
    ]  mongo-rs0-1:
    image: mongo:4.0.6
    network_mode: host
    volumes:
      - /srv/mongo-4.0-rs0-1:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28882",
      "--replSet", "wtf",
      "--oplogSize", "8192",
      "--enableMajorityReadConcern", "0"
    ]  mongo-rs0-arb:
    image: mongo:4.0.6
    network_mode: host
    volumes:
      - /srv/mongo-4.0-rs0-arb:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28883",
      "--replSet", "wtf",
      "--oplogSize", "1024",
      "--enableMajorityReadConcern", "0"
    ]# cat 4.2/docker-compose.yaml
version: "3.4"
services:
  mongo-rs0-0:
    image: mongo:4.2.8
    network_mode: host
    volumes:
      - /srv/mongo-4.2-rs0-0:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28881",
      "--replSet", "wtf",
      "--oplogSize", "8192",
      "--enableMajorityReadConcern", "0"
    ]  mongo-rs0-1:
    image: mongo:4.2.8
    network_mode: host
    volumes:
      - /srv/mongo-4.2-rs0-1:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28882",
      "--replSet", "wtf",
      "--oplogSize", "8192",
      "--enableMajorityReadConcern", "0"
    ]  mongo-rs0-arb:
    image: mongo:4.2.8
    network_mode: host
    volumes:
      - /srv/mongo-4.2-rs0-arb:/data/
    entrypoint: [
      "/usr/bin/mongod",
      "--dbpath", "/data/",
      "--storageEngine", "wiredTiger",
      "--journal",
      "--bind_ip_all",
      "--port", "28883",
      "--replSet", "wtf",
      "--oplogSize", "1024",
      "--enableMajorityReadConcern", "0"
    ]

Both clusters are initialized with:

rs.initiate()
rs.add("localhost:28882")
rs.addArb("localhost:28883")

For 4.0 I'm getting:
mean rate = 105534,54 calls/second

For 4.2:
mean rate = 30310,25 calls/second

JOURNALED:

  • writeConcern(WriteConcern.ACKNOWLEDGED).
    + writeConcern(WriteConcern.JOURNALED).

4.0: mean rate = 24108,72 calls/second
4.2: mean rate = 18993,58 calls/second

UNACKNOWLEDGED:

  • writeConcern(WriteConcern.ACKNOWLEDGED).
    + writeConcern(WriteConcern.UNACKNOWLEDGED).
    4.0: mean rate = 232100,54 calls/second
    4.2: starts around mean rate = 215253,48 calls/second and then drops to 51353,42 calls/second after few seconds

So I do not think it's just problem with a special case of 1 node RS.

Comment by Martin Mlynář [ 21/Aug/20 ]

Hi Eric,

first of all thank you for looking at this issue so fast! I'll re-run tests with master-slave-arbiter RS later this evening and let you know of my results. Unfortunately I'm a little bit skeptical as I've hit this issue with my full setup (2 shards, each consisting of master slave and arbiter) at first.

Good catch about the w: ACKNOWLEDGED, I've missed that. I'll also try same set of tests with patched tool with writeConcern set to JOURNALED and UNACKNOWLEDGED (which is closest to my real workload)

Comment by Eric Sedor [ 21/Aug/20 ]

Hi nextsux@gmail.com,

This is a regression specifically for the combination of using a 1-node replica set and performing writes with j:false.

j:false is the default journaling behavior implied by writeConcern.ACKNOWLEDGED, which is used by this test.

You should see improvement by using j:true or testing on a larger replica set. Unfortunately in this case, testing on a 1-node replica set is not representative of what you would see in commonly deployed 3-node replica sets on 4.2.

I'm going to pass this ticket on to an appropriate team for consideration.

Eric

Generated at Thu Feb 08 05:22:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.