[SERVER-33789] Retryable writes shell machinery also retries reads Created: 09/Mar/18 Updated: 29/Oct/23 Resolved: 15/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Shell |
| Affects Version/s: | None |
| Fix Version/s: | 3.7.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Minor Change | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | TIG 2018-03-26 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Comments |
| Comment by Max Hirschhorn [ 15/Mar/18 ] | ||||||||||||||||||||||
|
The changes from 352cf82 are an incomplete solution but enable running transactions tests with the "find" and "getMore" commands being automatically. I've filed SERVER-33921 as future work for myself to come back to this after pavithra.vetriselvan's changes from | ||||||||||||||||||||||
| Comment by Githook User [ 15/Mar/18 ] | ||||||||||||||||||||||
|
Author: {'email': 'max.hirschhorn@mongodb.com', 'name': 'Max Hirschhorn', 'username': 'visemet'}Message: | ||||||||||||||||||||||
| Comment by Jack Mulrow [ 12/Mar/18 ] | ||||||||||||||||||||||
max.hirschhorn I'm having trouble remembering the exact reason From some code inspection, as of v3.6, "create" was a passthrough command on mongos, which meant it was sent to shards using a ShardConnection, and I think somehow it wasn't updating mongos's ReplicaSetMonitor for the targeted shard which allowed the shell to use up its retries (shard connection is kind of hard to parse though, so this may be wrong). So the real problem was probably legacy code paths on mongos. In master, "create" is now a passthrough for a config server command, which does automatically retry, and I think kaloian.manassiev is getting rid of usages of ShardConnection, so if that was the real problem, we may be able to start always decrementing retry attempts without major code changes. | ||||||||||||||||||||||
| Comment by Max Hirschhorn [ 12/Mar/18 ] | ||||||||||||||||||||||
Maybe we should re-open
I think a quick-fix we could put in to unblock transactions testing would be to add driverSession.getOptions().shouldRetryWrites() to the condition where numRetries is assigned and have the number of retries be zero. I'm assuming that it is sufficient for the transactions testing to start sessions with {retryWrites: false} (the default behavior). spencer, how does that sound to you? I think we'll need to revisit "should this operation be retried?" after pavithra.vetriselvan from | ||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Mar/18 ] | ||||||||||||||||||||||
|
Yeah, the infinite retries is definitely worrisome. There is definitely no guarantee that a connect is closed any time you get a NotMaster error. In fact, if you actually get the NotMaster or NotMasterNoSlaveOk code back (as opposed to say InterruptedDueToReplStateChange) that means that the node is fully stepped down (not in the process of stepping down), and thus the connection close part of stepdown is past, and any new connections should behave as normal. As for the retrying on reads behavior, I see how we got here since it didn't matter in 3.6, but now it does and is getting in the way of testing for transactions and snapshot reads, so it would be great if this could be fixed soon. | ||||||||||||||||||||||
| Comment by Max Hirschhorn [ 09/Mar/18 ] | ||||||||||||||||||||||
|
spencer, the way the mongo shell determines whether it should retry an operation is if a "txnNumber" property is specified in the command request. This isn't going to be compatible with the syntax for multi-statement transactions (both read-only and read/write) in MongoDB 4.0.
The expectation is that after a "not master" error the network connection would eventually be closed, which is why the number of retries isn't decremented. Is that not accurate? https://jira.mongodb.org/browse/SERVER-32804?focusedCommentId=1779462&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1779462 has some additional context for why this decision was made. | ||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Mar/18 ] | ||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 09/Mar/18 ] | ||||||||||||||||||||||
|
Attached a repro that runs a read that will fail with NotMasterNoSlaveOk (a retryable error) through a session. With logRetryAttempts set to true, the test results in logs like the following:
Not only is this wrong because it's retrying a read, not a write, it's also wrong because it seems to retry indefinitely - the logs claim there is only one retry attempt, but it seems to just keep happening forever. I believe this is masking test failures, and making some failures that do show up difficult to diagnose because the test fails later on than where the original failure actually was. |