[SERVER-45565] Ensure ops inside of a >16 MB transaction with a command do not get batched with ops outside of the transaction Created: 14/Jan/20 Updated: 29/Oct/23 Resolved: 04/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.4.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Maria van Keulen | Assignee: | Andrew Chen |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Backport Requested: |
v4.4
|
||||||||
| Sprint: | Execution Team 2020-02-24, Execution Team 2020-03-09 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 45 | ||||||||
| Description |
|
This ticket came from a conversation about |
| Comments |
| Comment by Githook User [ 06/Mar/20 ] |
|
Author: {'name': 'Andrew Chen', 'email': 'andrew.chen@10gen.com'}Message: create mode 100644 jstests/noPassthrough/large_txn_correctness.js (cherry picked from commit 9557a35b779a5f6c1a2453514ec9cc67119b288b) create mode 100644 jstests/noPassthrough/large_txn_correctness.js |
| Comment by Githook User [ 04/Mar/20 ] |
|
Author: {'name': 'Andrew Chen', 'email': 'andrew.chen@10gen.com'}Message: create mode 100644 jstests/noPassthrough/large_txn_correctness.js |
| Comment by Andy Schwerin [ 25/Feb/20 ] |
|
Thanks, matthew.russotto. geert.bosch, I'm convinced. |
| Comment by Matthew Russotto [ 25/Feb/20 ] |
|
I think geert.bosch's proposed solution is probably the simplest and is correct at least for createCollection. If we were to allow dropCollection, we'd need to process the entire batch individually (avoiding batching with oplog entries that are before the transaction as well). To be consistent with under-16MB transactions with commands I think adding multi-operation transaction commits to mustProcessIndividually is probably better. As for performance, I don't know how bad the effect of essentially reducing batch size from 100MB to 16MB for strings of just-over-16MB transactions would be. Might be worth a test. I think the problem is the batcher used to be able know the batch constraints, and with multi-document transactions it no longer can. We could in principle fix this by reading the entire transaction in the batcher, applying batch limits and rules, and passing it to the applier. This is not difficult in principle but I think gets complicated in the code, and is probably not worth it unless the performance issues are significant. |
| Comment by Siyuan Zhou [ 24/Feb/20 ] |
|
matthew.russotto, could you please help evaluate the proposal for correctness and performance concerns? Matthew implemented the batching logic of large transactions. |
| Comment by Geert Bosch [ 21/Feb/20 ] |
|
If a customer has a workload with many >16MB transactions, there is a good chance that multiple commits happen in a single batch, inflating the batch size beyond intended limits, with likelihood of substandard performance or even failure on smaller systems. In the failure case, restarting may well retry the exact same scenario with more failures as result. So, when large transactions are common, we need to avoid overly large batches to avoid thrashing and crashing. Even for a high oplog write rate of 250 MB/sec, we'd only have 15 batches per second, so the effect of fixed per batch processing cost on overall performance should be modest. If > 16MB transactions are rare (the overwhelmingly common case, I'd hope), there is no performance implication whatsoever. In the end, I think that stability and limiting memory pressure in the presence of large transactions trumps potential performance impact of batching too much. If we want to have more sophisticated batch logic at some point, that is fine with me, but today doesn't seem to be the right time. |
| Comment by Maria van Keulen [ 20/Feb/20 ] |
|
schwerin We believe this solution to be the least complex way to resolve the bug described in the ticket. Otherwise, it seems like nontrivial, potentially risky, logic needs to be added to the batcher to properly group the operations inside a >16 MB transaction that executes a command. |
| Comment by Andy Schwerin [ 12/Feb/20 ] |
|
geert.bosch, while I see how that heuristic might make resolving this bug easier, I don't love adding more limitations to when a batch boundary must be drawn. I also don't know that I agree that a single transaction that's just over 16MB is necessarily a reasonable limit to the amount of work for a single replication batch. What's your justification? While a 20MB transaction could write over a gigabyte of data, I don't think that's necessarily a common occurrence. |
| Comment by Geert Bosch [ 24/Jan/20 ] |
|
It is probably best to have the batcher always end the batch after a > 16 MB txn commit. This is never too pessimistic, because even the single > 16MB transaction is a reasonable amount of work. The current heuristics of ending a batch after either 5000 operations or 100MB worth of oplog, but even just 60 commits for a >16MB transaction can easily exceed 1 GB of data written. |