[SERVER-35219] Regain MongoDB balancer performance with sessions Created: 25/May/18 Updated: 29/Oct/23 Resolved: 20/Mar/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.6.5 |
| Fix Version/s: | 3.6.12, 4.0.8, 4.1.10 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Adun | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 2 |
| Labels: | bkp | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||||||||||||||
| Sprint: | Sharding 2018-07-16, Sharding 2018-12-17, Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28, Sharding 2019-02-11, Sharding 2019-02-25, Sharding 2019-03-11, Sharding 2019-03-25 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||
| Description |
| Comments |
| Comment by Githook User [ 22/Mar/19 ] | |||||||||
|
Author: {'name': 'Randolph Tan', 'username': 'renctan', 'email': 'randolph@10gen.com'}Message: (cherry picked from commit 6d774652650dff718a8fa89c2bc845c3b11aa051) | |||||||||
| Comment by Githook User [ 21/Mar/19 ] | |||||||||
|
Author: {'name': 'Randolph Tan', 'username': 'renctan', 'email': 'randolph@10gen.com'}Message: (cherry picked from commit 6d774652650dff718a8fa89c2bc845c3b11aa051) | |||||||||
| Comment by Githook User [ 20/Mar/19 ] | |||||||||
|
Author: {'email': 'randolph@10gen.com', 'name': 'Randolph Tan', 'username': 'renctan'}Message: | |||||||||
| Comment by Kaloian Manassiev [ 14/Jan/19 ] | |||||||||
|
This plan SGTM, but I think the trick will be in the implementation details. Minor detail is that enterCriticalSection can also be counted as a producer since it needs to cause top/pop to return the EOF token. One thing to be mindful of is that this needs to be backported to 3.6 and I believe the futures code is not there. Not sure how easy it is to backport so it might be simpler to use Notification<void/OpTime> instead (CC redbeard0531). You will also add unit-tests for the enterCriticalSection scenario, right? | |||||||||
| Comment by Misha Tyulenev [ 10/Jan/19 ] | |||||||||
|
Capturing offline discussion with kaloian.manassiev
The idea is to encapsulate all the produce consumer processing in an object that has the following API
It looks a lot like producerConsumer queue already implemented in the util, the major difference is that it will return a the top() performs a little more as it needs to merge the data from transaction table and datga written to the oplog. The unset future needs to bubble up all the way so the caller will wait on it outside the collection lock that is set for write error retry loop. | |||||||||
| Comment by Kaloian Manassiev [ 09/Jan/19 ] | |||||||||
The contract of the _getNextSessionMods command (which is what fetches entries from nextSessionMigrationBatch is that it must not return non-majority committed session entries. The easiest would be to wait on the maximum optime from the entries in the populated arrBuilder, but since it is not optimal to re-scan it every time, I'd say there could just be a second output parameter to nextSessionMigrationBatch, which will contain what is the maximum opTime in that array.
This is the concept that I was imagining. The SessionCatalogSource is essentially a fairly simple multi-producer/single-consumer queue and its only complication comes from the fact that waiting cannot be done under the collection lock. So conceptually, its contract becomes "if I can give you data, I will give it, otherwise I will give you a future to wait on outside of the lock so you can call me again". The only time where data may not be available is if there aren't any retryable writes that ran. Does that make sense?
I am not sure I understand that, let's talk in person.
The SessionCatalogMigrationSource and the MigrationChunkClonerSourceLegacy have different interfaces. The former gives you one session entry (OplogEntry) at a time, while the latter fills a whole buffer. There is no way to avoid converting them, but don't see an issue with that either. | |||||||||
| Comment by Misha Tyulenev [ 08/Jan/19 ] | |||||||||
|
Thank you for the proposal kaloian.manassiev a) MigrationChunkClonerSourceLegacy::nextSessionMigrationBatch got to be changed from the current model of synchronous iteration to
b) fetchNextFutureOplog() or in your suggestions a code that sets _lastReturnedFuture has to differentiate between writeOpLog buffer, or sessionOpLog data. | |||||||||
| Comment by Kaloian Manassiev [ 07/Jan/19 ] | |||||||||
|
In terms of implementation, I propose the following:
| |||||||||
| Comment by Kaloian Manassiev [ 07/Dec/18 ] | |||||||||
|
I agree - they can go away in 4.2 and the loop on the recipient side can be simplified since it doesn't need to expect or send them anymore, nor does it need to sleep. Please don't forget to file a ticket. | |||||||||
| Comment by Misha Tyulenev [ 07/Dec/18 ] | |||||||||
|
Yes, thanks for the detailed description. kaloian.manassiev Will only add that those parameters must be marked as transietory and should be removed in the version followed 4.2 | |||||||||
| Comment by Kaloian Manassiev [ 07/Dec/18 ] | |||||||||
|
Steps 1-3 sound good to me. I made a couple of small clarifications on 2 that the command will become blocking and that it's protocol will become - call it until it returns no results (EOF) or until it fails with anything other than TimeLimitExceeded (if the caller specified a MaxTimeMS, which I think is prudent to do). For Step 4 - the (a) and (b) variants are actually both needed, they are not two different solutions as far as I can see. Let me try to clarify what you described to make sure we are on the same page:
That way the following combinations of partial upgrade will continue to work seamlessly:
Does this match your understanding? | |||||||||
| Comment by Misha Tyulenev [ 06/Dec/18 ] | |||||||||
|
Here is the breakdown of the proposed approach to the finer steps per a discussion with kaloian.manassiev 1. Sleep no more on the recipient: The _getNextSessionMods command on the donor execution follows the following path:
2. Make _getNextSessionMods command blocking:
3. This loop should be modified to return the future. 4. Lastly, the multi version shards should support this protocol. There are two ways to solve it: | |||||||||
| Comment by Kaloian Manassiev [ 30/Nov/18 ] | |||||||||
|
While looking at the retryable writes migration code for an unrelated reason, I noticed this backoff sleep and I am pretty sure this is what is contributing to the sometimes 400ms stall during the migration critical section. As implemented currently I think it is likely to be at least one 200 msec stall at transaction commit, because we need to drain the oplog at least once after the recipient enters the commit phase. However, in pathological situations the commit could be entered just after the donor has returned an empty batch, but just before commit is entered. In this case there will be one 200 msec wait before entering the commit phase and one 200 msec wait after entering it. And I think actually in the current implementation, not having any retryable writes running would make it worse because that's when the back-off would get activated since the batches will always be empty renctan, can you confirm my hypothesis above and whether the proposed solution makes sense? Proposed fix Basically the donor will block the _getNextSessionMods call until there are any oplog entries be available to return. Empty return from that command should indicate "no more entries to migrate". | |||||||||
| Comment by Sarah Zhou [ 27/Jul/18 ] | |||||||||
|
Thanks for the report! The introduction of https://docs.mongodb.com/manual/reference/server-sessions/ and https://docs.mongodb.com/manual/core/retryable-writes/ in 3.6 led to additional overhead in the migration process which as a result slowed down the balancer. We ran tests comparing moveChunk in 3.4 and 3.6 (balancer issues moveChunk commands in the balancing process), and I've attached two graphs from our findings which illustrate in what steps specifically the increase for moveChunk is. We will continue to further investigate this decrease in performance and keep you updated. |