Major - P3
A new TransactionTooLargeForCache error has been introduced. This indicates that the transaction was rolled back due to cache pressure, and is unlikely to complete even if retried due to the cache being insufficient.
The threshold at which this error is triggered can be modified with the transactionTooLargeForCacheThreshold, and setting it to 1.0 disables this behaviour. Basically, if the transaction accounts for more than 75% (default) total dirty cache use and is rolled back, it is assumed that it is unlikely to complete. Since the dirty cache limit is 20% of the total cache, this means that the largest transactions may only occupy 15% of the total size of the storage engine cache.
If the conditions are met, it may now be the case that a TransactionTooLargeForCache is thrown instead of a TemporarilyUnavailable or WriteConflict.
Inserting a document that creates a large number of index entries can create a large amount of dirty data in a single transaction, causing it to be canceled and retried indefinitely, resulting in a hang.
For example on a node with a 256 MB cache, create a text index then insert a document with a large string to be indexed, or equivalently a lot of terms to be indexed:
This will hang after a few documents, with high cache pressure, and the following emited repeatedly in the log:
This will effectively make the server inoperational due to cache pressure. If it occurs on the secondaries they will stall because it will prevent completion of the current batch.
This is a regression as these inserts complete successfully (even if somewhat slowly) in 4.2.
I think this is related to
SERVER-61454, but I'm opening this as a distinct ticket because
- This is a somewhat different use case as the issue can be reliably created with single inserts.
- I don't think the change described in
SERVER-61454would apply here, as the insert is the only transaction running so delaying retries would have no effect, and the issue is not related to CPU resource starvation as far as I can tell.
- It's not clear to me where the appropriate fix would lie - query layer, retry behavior, storage engine behavior.