[DOCS-13868] Investigate changes in SERVER-48318: Risk of StaleChunkHistory errors in sharded transactions Created: 09/Sep/20  Updated: 13/Nov/23  Resolved: 11/Oct/21

Status: Closed
Project: Documentation
Component/s: manual, Server
Affects Version/s: None
Fix Version/s: 4.7.0, Server_Docs_20231030, Server_Docs_20231106, Server_Docs_20231105, Server_Docs_20231113

Type: Task Priority: Major - P3
Reporter: Backlog - Core Eng Program Management Team Assignee: Jason Price
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
documents SERVER-48318 Risk of StaleChunkHistory errors in s... Closed
Participants:
Days since reply: 2 years, 17 weeks, 2 days ago
Epic Link: DOCSP-15042
Story Points: 3

 Description   

Description

Downstream Change Summary

The snapshot history window is now the max of (minSnapshotHistoryWindowInSeconds, transactionLifetimeLimitSeconds, 10) where 10 seconds is the hardcoded lower bound for snapshot history window. Please refer to Max's comment here https://jira.mongodb.org/browse/SERVER-48318?focusedCommentId=3364500&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-3364500 for the information that should be included in the documentation for transactionLifetimeLimitSeconds.

Description of Linked Ticket

While reviewing the changes for SERVER-47785 with renctan, we wondered if the previous version of the code had a bug. Before, ShardingCatalogManager::commitChunkMigration removed all chunk history entries older than 10 seconds whenever it writes a new entry. Even after, it removes all but one of them.

A new transaction always chooses a recent timestamp, even with readConcern majority. This is the "speculative majority" behavior. But transactions have a default 60-second lifetime, and chunk history only lasts 10 seconds. Do we see the following?:

  • Start a sharded transaction
  • Choose transaction read timestamp T
  • 10 seconds pass
  • A chunkMove clears history entries before T for chunk C
  • The transaction continues and targets C
  • ChunkInfo::getShardIdAt tries to read at T, throws StaleChunkHistory error
  • mongos returns error to the client with TransientTransactionError label

Transactions cannot retry StaleChunkHistory (SERVER-39704) and I think this particular case could never be retried, since the history is truly gone.

If the client uses a driver's withTransaction API then TransientTransactionError will compel it to retry the transaction from the start and probably succeed. It can retry for up to 120 seconds. It would have to be unlucky for the sequence above to repeat for that long.

However, I think we can reduce the incidence of retries by keeping chunk history for at least transactionLifetimeLimitSeconds.

Scope of changes

Impact to Other Docs

MVP (Work and Date)

Resources (Scope or Design Docs, Invision, etc.)



 Comments   
Comment by Githook User [ 11/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-13868 chunk errors in sharded transactions
Branch: v5.0
https://github.com/mongodb/docs/commit/981c750e38055d5db0b85fa37973397c8e65011a

Comment by Githook User [ 07/Oct/21 ]

Author:

{'name': 'jason-price-mongodb', 'email': 'jshfjghsdfgjsdjh@aolsdjfhkjsdhfkjsdf.com'}

Message: DOCS-13868 chunk errors in sharded transactions
Branch: master
https://github.com/mongodb/docs/commit/74727ed269285dfe75acc872bc8e66d61c32c382

Generated at Thu Feb 08 08:08:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.