[SERVER-35219] Regain MongoDB balancer performance with sessions Created: 25/May/18  Updated: 29/Oct/23  Resolved: 20/Mar/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.5
Fix Version/s: 3.6.12, 4.0.8, 4.1.10

Type: Improvement Priority: Major - P3
Reporter: Adun Assignee: Randolph Tan
Resolution: Fixed Votes: 2
Labels: bkp
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File chart-2.png     PNG File chart.png    
Issue Links:
Backports
Depends
depends on SERVER-38874 Add ability to observe entering criti... Closed
Duplicate
Problem/Incident
Related
related to SERVER-40187 Remove waitsForNewOplog response fiel... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-07-16, Sharding 2018-12-17, Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28, Sharding 2019-02-11, Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25
Participants:
Case:
Linked BF Score: 0

 Description   
 Original Summary

MongoDB 3.6 Balancer much slower than 3.4

Original Description

I have a test for balancer between 3.6 and 3.4, and I found balancing on 3.6 is much slower than 3.4.

I insert 100 million docs to 3.6 and 3.4, the single doc size is 2 kbytes. 

Collection initial as below:

db.runCommand({shardCollection: "ycsb.test", key: {_id: "hashed"}, numInitialChunks: 6500}) 

Test Results:

  1. 3.4.15 with MMAPv1 engine: 
    1. from 1 shard to 2 shards: use 41866 seconds, after balance, 3250 chunks on shard.
    2. from 2 sahrds to 4 shards: use 30630 seconds, after balance 1625 chunks on 1 shard.
  2. 3.6.5 with MMAPv1 engine:
    1. from 1 shard to 2 shards: use 90200 seconds, after balance, 3250 chunks on shard.
    2. from 2 sahrds to 4 shards: use 44679 seconds, after balance 1625 chunks on 1 shard.
  3. 3.4.15 with wiredTiger engine:
    1. from 1 shard to 2 shards: use 35635 seconds, after balance, 3250 chunks on shard.
    2. from 2 sahrds to 4 shards: use 10740 seconds, after balance 1625 chunks on 1 shard.
  4. 3.6.5 with wiredTiger engine:
    1. from 1 shard to 2 shards: use 49762 seconds, after balance, 3250 chunks on shard.
    2. from 2 sahrds to 4 shards: use 18961 seconds, after balance 1625 chunks on 1 shard.

MongoDB configuration for MMAPv1 engine:

security:
  authorization: disabled
sharding:
  clusterRole: shardsvr
replication:
  replSetName: rs1
systemLog:
  logAppend: true
  destination: file
  path: /home/adun/3.4/log/mongod.log

processManagement:
  fork: true
  pidFilePath: /home/adun/3.4/log/mongod.pid

net:
  port: 27017
  bindIp: 127.0.0.1,192.168.10.31
  maxIncomingConnections: 65536

storage:
  dbPath: /home/adun/3.4/data
  directoryPerDB: true
  engine: mmapv1

MongoDB configuration for wiredTiger engine:

security:
  authorization: disabled
sharding:
  clusterRole: shardsvr
replication:
  replSetName: rs1
systemLog:
  logAppend: true
  destination: file
  path: /home/adun/3.4/log/mongod.log

processManagement:
  fork: true
  pidFilePath: /home/adun/3.4/log/mongod.pid

net:
  port: 27017
  bindIp: 127.0.0.1,192.168.10.31
  maxIncomingConnections: 65536

storage:
  dbPath: /home/adun/3.4/data
  directoryPerDB: true
  engine: wiredTiger

  wiredTiger:
    engineConfig:
      cacheSizeGB: 32
      directoryForIndexes: true

Other Questions:

  1. If we set numInitialChunks to a small value, such as 100, mongodb will create/split chunk by itself, but even if we insert the same data/same records number, 3.6 will create chunks abont 10% more than 3.4. (both MMAPv1 and wiredTiger)


 Comments   
Comment by Githook User [ 22/Mar/19 ]

Author:

{'name': 'Randolph Tan', 'username': 'renctan', 'email': 'randolph@10gen.com'}

Message: SERVER-35219 Change the sleep on the destination side into a cond var wait on the donor side of session migration.

(cherry picked from commit 6d774652650dff718a8fa89c2bc845c3b11aa051)
Branch: v4.0
https://github.com/mongodb/mongo/commit/2b576d56ab6ac150ce7b1a5b0f592ffdcca105e9

Comment by Githook User [ 21/Mar/19 ]

Author:

{'name': 'Randolph Tan', 'username': 'renctan', 'email': 'randolph@10gen.com'}

Message: SERVER-35219 Change the sleep on the destination side into a cond var wait on the donor side of session migration.

(cherry picked from commit 6d774652650dff718a8fa89c2bc845c3b11aa051)
Branch: v3.6
https://github.com/mongodb/mongo/commit/979d456ccf4e6756e433799a626ca373a493fc8a

Comment by Githook User [ 20/Mar/19 ]

Author:

{'email': 'randolph@10gen.com', 'name': 'Randolph Tan', 'username': 'renctan'}

Message: SERVER-35219 Change the sleep on the destination side into a cond var wait on the donor side of session migration.
Branch: master
https://github.com/mongodb/mongo/commit/6d774652650dff718a8fa89c2bc845c3b11aa051

Comment by Kaloian Manassiev [ 14/Jan/19 ]

This plan SGTM, but I think the trick will be in the implementation details.

Minor detail is that enterCriticalSection can also be counted as a producer since it needs to cause top/pop to return the EOF token.

One thing to be mindful of is that this needs to be backported to 3.6 and I believe the futures code is not there. Not sure how easy it is to backport so it might be simpler to use Notification<void/OpTime> instead (CC redbeard0531).

You will also add unit-tests for the enterCriticalSection scenario, right?

Comment by Misha Tyulenev [ 10/Jan/19 ]

Capturing offline discussion with kaloian.manassiev
Identified producer and consumer running on the separate threads:

  • The producer is notifyNewWriteOpTime that comes from the opLogObserver - i.e. this is any new write to an oplog that comes from multiple threads
  • The consumer is the _getNextSessionMods command that runs on the runCommanbd thread and consumes the buffer of opTimes filled by the producer by fetching the oplogs matching the opTimes.
  • The MigrationSourceManager which is also runs on its own thread initiates enterCriticalSection which blocks all new writes

The idea is to encapsulate all the produce consumer processing in an object that has the following API

top()
pop()
push()
enterCriticalSection()

It looks a lot like producerConsumer queue already implemented in the util, the major difference is that it will return a
setFuture, unsetFuture or set with none future to indicate no more data will be written - as outlined in the previous comments.

the top() performs a little more as it needs to merge the data from transaction table and datga written to the oplog.

The unset future needs to bubble up all the way so the caller will wait on it outside the collection lock that is set for write error retry loop.

Comment by Kaloian Manassiev [ 09/Jan/19 ]

1. please clarify how to indicate to waitForMajority on the returned data. opLogResult has the boolean indicating if it needs to be waited or not

The contract of the _getNextSessionMods command (which is what fetches entries from nextSessionMigrationBatch is that it must not return non-majority committed session entries.

The easiest would be to wait on the maximum optime from the entries in the populated arrBuilder, but since it is not optimal to re-scan it every time, I'd say there could just be a second output parameter to nextSessionMigrationBatch, which will contain what is the maximum opTime in that array.

2. not sure what structure you mean here, if it's PromiseAndFuture it cannot be exposed and not supposed to be held

This is the concept that I was imagining. The SessionCatalogSource is essentially a fairly simple multi-producer/single-consumer queue and its only complication comes from the fact that waiting cannot be done under the collection lock.

So conceptually, its contract becomes "if I can give you data, I will give it, otherwise I will give you a future to wait on outside of the lock so you can call me again". The only time where data may not be available is if there aren't any retryable writes that ran.

Does that make sense?

3. I like this idea but to have a benefit of futurizing the following changes seems to me inline with future use:

I am not sure I understand that, let's talk in person.

5. depends on how the nextSessionMigrationBatch is implemented - it will be great to avoid converting Future<boost::optional<OplogEntry>> into Future<bool>

The SessionCatalogMigrationSource and the MigrationChunkClonerSourceLegacy have different interfaces. The former gives you one session entry (OplogEntry) at a time, while the latter fills a whole buffer. There is no way to avoid converting them, but don't see an issue with that either.

Comment by Misha Tyulenev [ 08/Jan/19 ]

Thank you for the proposal kaloian.manassiev
Here are my thoughts, could you please clarify?
1. please clarify how to indicate to waitForMajority on the returned data. opLogResult has the boolean indicating if it needs to be waited or not
2. not sure what structure you mean here, if it's PromiseAndFuture it cannot be exposed and not supposed to be held
3. I like this idea but to have a benefit of futurizing the following changes seems to me inline with future use:

a) MigrationChunkClonerSourceLegacy::nextSessionMigrationBatch got to be changed from the current model of synchronous iteration to

while (true) {
   auto future = _sessionCatalogSource->fetchNextFutureOplog();
   if (future.isReady()) {
      // add  data to buffer or return
   }
   else {
     return future;
   }
}

b) fetchNextFutureOplog() or in your suggestions a code that sets _lastReturnedFuture has to differentiate between writeOpLog buffer, or sessionOpLog data.
Can it be both possible futures?
4. 6 Agree and it eliminates the need for SERVER-38874
5. depends on how the nextSessionMigrationBatch is implemented - it will be great to avoid converting Future<boost::optional<OplogEntry>> into Future<bool>
Thanks!

Comment by Kaloian Manassiev [ 07/Jan/19 ]

In terms of implementation, I propose the following:

  1. Get rid of SessionCatalogMigrationSource::OplogResult and since OplogEntry is self-describing and contains its optime, instead use that directly
  2. Add a boost::optional<SharedFutureAndPromise<OplogResult>> _lastReturnedFuture to SessionCatalogMigrationSource.
  3. Change SessionCatalogMigrationSource::getLastFetchedOplog to return Future<boost::optional<OplogEntry>>, where the return values mean the following:
    • Set future with OplogEntry means "append that entry to the out buffer"
    • Set future with boost::none means "no more session oplog entries are necessary to be migrated" (this can only happen if onCommitStarted was called)
    • Unset future means the caller needs to block waiting to see what to do next
    • Exception thrown means abandon migration
  4. Add function called SessionCatalogMigrationSource::onCommitStarted, which will be called from MigrationChunkClonerSourceLegacy::commitClone and will indicate that SessionCatalogMigrationSource::notifyNewWriteOpTime will no longer be called and that SessionCatalogMigrationSource::getLastFetchedOplog must return boost::none instead of unset futures if it doesn't find anything in the buffer and that it should set the _lastReturnedFuture.promise to boost::none.
  5. Change MigrationChunkClonerSourceLegacy::nextSessionMigrationBatch to return Future<bool> where the return values mean the following:
    • Set future with "true" means something was put in the buffer and there is more data, so nextSessionMigrationBatch can be called again and will put more stuff in the buffer
    • Set future with "false" means nothing was available to put in the buffer and the "no more session oplog entries are necessary to be migrated"
    • Unset future means the caller needs to drop locks and block waiting to see what to do next
  6. Change MigrationChunkClonerSourceLegacy::commitClone to call SessionCatalogMigrationSource::onCommitStarted.
Comment by Kaloian Manassiev [ 07/Dec/18 ]

I agree - they can go away in 4.2 and the loop on the recipient side can be simplified since it doesn't need to expect or send them anymore, nor does it need to sleep.

Please don't forget to file a ticket.

Comment by Misha Tyulenev [ 07/Dec/18 ]

Yes, thanks for the detailed description. kaloian.manassiev Will only add that those parameters must be marked as transietory and should be removed in the version followed 4.2

Comment by Kaloian Manassiev [ 07/Dec/18 ]

Steps 1-3 sound good to me. I made a couple of small clarifications on 2 that the command will become blocking and that it's protocol will become - call it until it returns no results (EOF) or until it fails with anything other than TimeLimitExceeded (if the caller specified a MaxTimeMS, which I think is prudent to do).

For Step 4 - the (a) and (b) variants are actually both needed, they are not two different solutions as far as I can see. Let me try to clarify what you described to make sure we are on the same page:
The _getNextSessionsMods command will change like this:

  • Introduce a new optional boolean parameter called say waitForData, which when specified will request the "long-poll" behaviour described in Step 2 and if set to false or not specified, will behave as it does currently and return empty results.
  • Make _getNextSessionMods command return a boolean waited field (or done like you proposed, which is the inverted value), which will be set if the command blocked waiting for results.
  • The new version recipient will call _getNextSessionModes, waitForData: true and will sleep here only if the return from that command was missing the waited field in the response.
  • The *new version" donor, when executing the _getNextSessionMods command, will only block if waitForData: true - otherwise it will behave exactly as it does today.

That way the following combinations of partial upgrade will continue to work seamlessly:

  • Old donor, New recipient - Old donor will not include the waited field so the New recipient will back-off like it does today
  • New donor, Old recipient - Old recipient will not specify the waitForData field, so the New donor will return immediately
  • New donor, New recipient - The recipient will specify waitForData: true and the donor will sleep, returning waited:true, which will cause the recipient to not sleep

Does this match your understanding?

Comment by Misha Tyulenev [ 06/Dec/18 ]

Here is the breakdown of the proposed approach to the finer steps per a discussion with kaloian.manassiev

1. Sleep no more on the recipient:
Remove the backoff sleep code. Instead, the recipient should wait on the donor to execute the _getNextSessionMods command, which will be made blocking instead.

The _getNextSessionMods command on the donor execution follows the following path:

  • It is wrapped in the writeConflictRetry loop
  • It iterates while SessionCatalogMigrationSource::hasMoreOplog() and breaks out of the loop if the size of the result exceeds the limit.

2. Make _getNextSessionMods command blocking:
In order to wait on the donor it would be sufficient to return a future that can be waited on in the _getNextSessionMods's run method.

while( auto future = autoCloner.getCloner()->nextSessionMigrationBatch()) {
    future->wait();
}

3. This loop should be modified to return the future.

4. Lastly, the multi version shards should support this protocol. There are two ways to solve it:
 a. by signalling that no more wait is needed by returning 0
 b. use an extra parameter _done in the command as in SERVER-32886

Comment by Kaloian Manassiev [ 30/Nov/18 ]

While looking at the retryable writes migration code for an unrelated reason, I noticed this backoff sleep and I am pretty sure this is what is contributing to the sometimes 400ms stall during the migration critical section.

As implemented currently I think it is likely to be at least one 200 msec stall at transaction commit, because we need to drain the oplog at least once after the recipient enters the commit phase. However, in pathological situations the commit could be entered just after the donor has returned an empty batch, but just before commit is entered. In this case there will be one 200 msec wait before entering the commit phase and one 200 msec wait after entering it.

And I think actually in the current implementation, not having any retryable writes running would make it worse because that's when the back-off would get activated since the batches will always be empty

renctan, can you confirm my hypothesis above and whether the proposed solution makes sense?

Proposed fix
Remove the back-off sleep from the recipient side and make the _getNextSessionMods command to use long-poll instead and make it the donor's responsibility to decide when end-of-stream is reached (which would be that there are no more oplog entries to return after entering the critical section).

Basically the donor will block the _getNextSessionMods call until there are any oplog entries be available to return. Empty return from that command should indicate "no more entries to migrate".

Comment by Sarah Zhou [ 27/Jul/18 ]

Hi stutiredboy@gmail.com

Thanks for the report! The introduction of https://docs.mongodb.com/manual/reference/server-sessions/ and https://docs.mongodb.com/manual/core/retryable-writes/ in 3.6 led to additional overhead in the migration process which as a result slowed down the balancer. We ran tests comparing moveChunk in 3.4 and 3.6 (balancer issues moveChunk commands in the balancing process), and I've attached two graphs from our findings which illustrate in what steps specifically the increase for moveChunk is. We will continue to further investigate this decrease in performance and keep you updated.

Generated at Thu Feb 08 04:39:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.