[SERVER-71503] Big disk IO (write) regression (×8) when setting fCV to 4.4 Created: 18/Nov/22  Updated: 27/Oct/23  Resolved: 03/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adrien Jarthon Assignee: Backlog - Replication Team
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File SERVER-71503-diskIOPS.png     PNG File SERVER-71503-getMores-increse.png     PNG File image-2022-11-18-23-37-51-756.png     PNG File image-2022-11-18-23-40-32-446.png     PNG File image-2022-11-18-23-54-38-636.png     PNG File image-2022-11-22-11-03-08-057.png     PNG File image-2022-12-21-16-29-47-844.png     PNG File mongodb-4.4-secondary-IO-regression.png    
Issue Links:
Related
is related to SERVER-54939 Investigate secondary batching behavi... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

I just tried upgrading my RS running 4.4.18 from fCV 4.2 to fCV 4.4 (yes it has been running as 4.2 for a white ^^). Unfortunately bad surprise, I immediately saw the disk IO load (write) on both secondary jump to the roof (about 8 times from ~12% to ~90%). The amount of data written only slightly increased though, so it's mostly the number of writes, which are now smaller I suppose. There's also a ×2 increase in the CPU usage and a small network usage increase, probably related. As soon as I downgraded back to fCV 4.2, the IO, CPU and network load went down back to previous levels (times are UTC+1 here):

CPU:

I checked of course the features which requires fCV 4.4 but there's only two listed here and I don't see any likely culprit: https://www.mongodb.com/docs/manual/release-notes/4.4-compatibility/#std-label-4.4-compatibility-enabled

So I suppose there are other changes caused by this flag, are you aware of any change that could cause this huge regression? Maybe there's a configuration default which changes that I could tune?

I have the diagnostic.data and logs for these servers if you let me know where I can upload them safely.

I wanted to do this in order to prepare the upgrade to 5.0 and 6.0, but I can't really continue with the upgrade or try newer versions if 4.4 already puts me in such a difficult position. Also sorry if there's already a ticket about this, I did try to search but couldn't find any.

Thanks!



 Comments   
Comment by Alan Zheng [ 03/Jan/23 ]

Thank you bigbourin@gmail.com for the thorough investigation.

 

We will update our 4.4 to 5.0 upgrade documentation (DOCS-15800) to explain this behavior and the oplogBatchDelayMillis tunable parameter. If customers are concerned about the high i/o with fCV 4.4, turning off Streaming Replication via oplogFetcherUsesExhaust is also another option.

Comment by Adrien Jarthon [ 21/Dec/22 ]

Ok so I just tried this:

  • set fCV to 4.4 to prepare for the upgrade (so starting the intensive IO write phase)
  • upgrade secondary to 5.0.14 (still the same IO write as expected)
  • set the oplogBatchDelay to 20ms
  • then try 100ms and 1000ms to see the impact

Here is the result on disk IO % (red) and number of disk writes/s (blue):

So with this tuning done it's much better indeed, I still had the huge increase in getMore on the primary though, and also an increase in network traffic caused by replication. So trying to look for the cause of this I noticed 4.4 introduces Streaming Replication and a new option oplogFetcherUsesExhaust to turn it off if needed. So I tried turning it off with this option (effectively reverting to the previous behavior of pooling the oplog in batches). And sure enough this "resolved" the remaining network increase caused by the streaming replication (not sure why streaming is consuming more? maybe it doesn't use compression as efficiently?)

This network use increase is about 30% in my case, here is the secondary before fCV 4.4 (lower), during the upgrade (higher), and after I set oplogFetcherUsesExhaust to false:

In the end I added this to my config file:

setParameter:
  oplogBatchDelayMillis: 200 # https://jira.mongodb.org/browse/SERVER-71503
  oplogFetcherUsesExhaust: false # https://jira.mongodb.org/browse/SERVER-71503

And so very important to note for other users: I didn't test this but I believe setting oplogFetcherUsesExhaust to false would be enough to prevent the huge IO jump when setting fCV to 4.4 (and as opposed to oplogBatchDelayMillis, this setting IS available since 4.4). So you can probably do this first to allow the fCV upgrade, then upgrade to 5.0, and then you can test and decide to enable the Streaming Replication again if the cost is worth it for your use-case.

Comment by Adrien Jarthon [ 20/Dec/22 ]

@Alan I suppose you mean I should not see the regression any more IF I tune this parameter to something other that the default value of 0, e.g. 20 or more, right? because if I don't simply upgrading to 5 or 6 is gonna show the same problem as it's the default value.

And so the "regression" is in 4.4 (it's a "feature" officially for people who need lower latency write majority concern) but the workaround is only in 5.0, it means people are forced to set fCV 4.4 first which means experience the huge regression for hours / days BEFORE they can upgrade to 5 and then try to set this value and hope it works (the roolback being more complicated/dangerous at this time as we need to go back to 4.4  binary to be able to set fCV to 4.2 again). All that is if they're aware of it of course, as this is not documented in the fCV 4.4 changes. In my case it might be acceptable because the machines I'm using can handle 8 times the IO write load but what about others users? most of them won't hold 8 times the IO write this well, what can they do if they can't take this regression first?

Also is it considered normal now that the default configuration generate 4-5 times more IO writes on the secondary than on the primary ? (about 2200 writes/sec in the example above while I tested fCV 4.4, versus 500 writes/sec on the primary the whole time). Why such amplification ?

Comment by Alan Zheng [ 20/Dec/22 ]

As of now, SERVER-65723 will not be backported to 4.4. We recommend customer to upgrade to the latest version of 6.0.

Comment by Driss Tahraoui-Maldague [ 20/Dec/22 ]

Hi Alan! is SERVER-65723 going to be backported to a future version 4.4?

Comment by Alan Zheng [ 19/Dec/22 ]

Hi Adrien - Thanks for reporting this issue. A fix for this problem is (SERVER-65723), which introduced the tunable parameter oplogBatchDelayMillis for versions 5.0 and 6.0. You should not see regressions when you upgrade to these versions. Unfortunately, this parameter was not backported to version 4.4.

Please let us know if this helps!

Comment by Adrien Jarthon [ 14/Dec/22 ]

Hi Yuan,

Thanks for the update!

Comment by Yuan Fang [ 13/Dec/22 ]

Hi bigbourin@gmail.com,

Thank you for uploading the diagnostic data and mongod log. This issue is still under investigation, and I'm passing this ticket to our Replication team for a close look. I hope I will get back to you soon. Thank you so much for your patience!

Regards,
Yuan
 

Comment by Adrien Jarthon [ 22/Nov/22 ]

Hi Yuan, thanks for the portal.

As during the last case I ran various attempts (upgrade/downgrade/restart/stepDown/etc..) to find the cause, in order to remove any unnecessary noise and variables and make it easier for you to diagnose, I just made another sample today (November 22nd) where I ran it again on fCV 4.4 again during 50 minutes:

MongoDB shell version v4.4.18
connecting to: mongodb://127.0.0.1:27017/?compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("6793e94f-1a1c-499a-b191-7b7f49cc2f9e") }
MongoDB server version: 4.4.18
 
# at ~09:59am UTC+1
rs0:PRIMARY> db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )
{
    "ok" : 1,
    "$clusterTime" : {
        "clusterTime" : Timestamp(1669107539, 617),
        "signature" : {
            "hash" : BinData(0,"rNHv4oCjKaSOmVpq1eSxqmnBILA="),
            "keyId" : NumberLong("7107894300460974477")
        }
    },
    "operationTime" : Timestamp(1669107539, 585)
}
 
# at ~10:51am UTC+1
rs0:PRIMARY> db.adminCommand( { setFeatureCompatibilityVersion: "4.2" } )
{
    "ok" : 1,
    "$clusterTime" : {
        "clusterTime" : Timestamp(1669110689, 656),
        "signature" : {
            "hash" : BinData(0,"USFHyiDw8k9gEs7+HJb9gsvWMBc="),
            "keyId" : NumberLong("7107894300460974477")
        }
    },
    "operationTime" : Timestamp(1669110689, 636)
}
rs0:PRIMARY> 
bye

The diagnostic.data + logs archives are uploaded to the portal for the 3 nodes (PSS). You can also see the first occurence I suppose if you want (November 18th ~ 6pm UTC+1).

Comment by Yuan Fang [ 21/Nov/22 ]

Hi bigbourin@gmail.com,

Thank you for your report. I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Regards,
Yuan

Comment by Adrien Jarthon [ 18/Nov/22 ]

Also I'm not sure if it's related but on the Primary, I did not notice any disk IO change but I did notice a small CPU increase, and looking at the charts I noticed a huge jump in the number of (instrumented at least) getMore:

This number also goes back down when I go back to fCV 4.2 as you can see above. So is this fCV 4.4 also changing someting in the default batchSize maybe? or does it start instrumenting some getMore which were not instrumented before? Because the application is always doing the same thing and has a very stable number of op/s as you can see.

Generated at Thu Feb 08 06:19:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.