[SERVER-33789] Retryable writes shell machinery also retries reads Created: 09/Mar/18  Updated: 29/Oct/23  Resolved: 15/Mar/18

Status: Closed
Project: Core Server
Component/s: Shell
Affects Version/s: None
Fix Version/s: 3.7.4

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File retryReads.js    
Issue Links:
Related
related to SERVER-33921 Revisit how the mongo shell decides w... Backlog
Backwards Compatibility: Minor Change
Operating System: ALL
Sprint: TIG 2018-03-26
Participants:
Linked BF Score: 0

 Comments   
Comment by Max Hirschhorn [ 15/Mar/18 ]

The changes from 352cf82 are an incomplete solution but enable running transactions tests with the "find" and "getMore" commands being automatically. I've filed SERVER-33921 as future work for myself to come back to this after pavithra.vetriselvan's changes from SERVER-33218 land in master.

Comment by Githook User [ 15/Mar/18 ]

Author:

{'email': 'max.hirschhorn@mongodb.com', 'name': 'Max Hirschhorn', 'username': 'visemet'}

Message: SERVER-33789 Only retry operations with txnNumber if retryWrites=true.
Branch: master
https://github.com/mongodb/mongo/commit/352cf826069d74ceb6c8a4b4ae2198098c05ba2a

Comment by Jack Mulrow [ 12/Mar/18 ]

Jack Mulrow, do you recall why the changes from 51b699b as part of SERVER-31194 didn't decrement the counter? Thinking about it now, I would have expected the ReplicaSetMonitor to have caused DBClientRS to do retargeting before running the next command.

max.hirschhorn I'm having trouble remembering the exact reason SERVER-31194 didn't decrement retries - I'm guessing it had to do with punting on finding a reasonable number of retries for slow variants to consistently pass - but I remember I had a version of SERVER-31198 that always did decrement the counter and it seemed to mostly work, except for some NotMaster errors from the "create" command (here's an old evergreen patch I found). I think between these occasional failures and the discussion you linked to in SERVER-32804, we decided to add back infinite retries.

From some code inspection, as of v3.6, "create" was a passthrough command on mongos, which meant it was sent to shards using a ShardConnection, and I think somehow it wasn't updating mongos's ReplicaSetMonitor for the targeted shard which allowed the shell to use up its retries (shard connection is kind of hard to parse though, so this may be wrong). So the real problem was probably legacy code paths on mongos.

In master, "create" is now a passthrough for a config server command, which does automatically retry, and I think kaloian.manassiev is getting rid of usages of ShardConnection, so if that was the real problem, we may be able to start always decrementing retry attempts without major code changes.

Comment by Max Hirschhorn [ 12/Mar/18 ]

Yeah, the infinite retries is definitely worrisome. There is definitely no guarantee that a connect is closed any time you get a NotMaster error. In fact, if you actually get the NotMaster or NotMasterNoSlaveOk code back (as opposed to say InterruptedDueToReplStateChange) that means that the node is fully stepped down (not in the process of stepping down), and thus the connection close part of stepdown is past, and any new connections should behave as normal.

Maybe we should re-open SERVER-32804 as I had modified DBClientConnection::handleNotMasterResponse() to inform the ReplicaSetMonitor when it receives a "not master" error response code (even if it doesn't contain the string "not master") in the changes from 6d8e6b9 as part of SERVER-31296. jack.mulrow, do you recall why the changes from 51b699b as part of SERVER-31194 didn't decrement the counter? Thinking about it now, I would have expected the ReplicaSetMonitor to have caused DBClientRS to do retargeting before running the next command.

As for the retrying on reads behavior, I see how we got here since it didn't matter in 3.6, but now it does and is getting in the way of testing for transactions and snapshot reads, so it would be great if this could be fixed soon.

I think a quick-fix we could put in to unblock transactions testing would be to add driverSession.getOptions().shouldRetryWrites() to the condition where numRetries is assigned and have the number of retries be zero. I'm assuming that it is sufficient for the transactions testing to start sessions with {retryWrites: false} (the default behavior). spencer, how does that sound to you? I think we'll need to revisit "should this operation be retried?" after pavithra.vetriselvan from SERVER-33218.

Comment by Spencer Brody (Inactive) [ 09/Mar/18 ]

Yeah, the infinite retries is definitely worrisome. There is definitely no guarantee that a connect is closed any time you get a NotMaster error. In fact, if you actually get the NotMaster or NotMasterNoSlaveOk code back (as opposed to say InterruptedDueToReplStateChange) that means that the node is fully stepped down (not in the process of stepping down), and thus the connection close part of stepdown is past, and any new connections should behave as normal.

As for the retrying on reads behavior, I see how we got here since it didn't matter in 3.6, but now it does and is getting in the way of testing for transactions and snapshot reads, so it would be great if this could be fixed soon.

Comment by Max Hirschhorn [ 09/Mar/18 ]

spencer, the way the mongo shell determines whether it should retry an operation is if a "txnNumber" property is specified in the command request. This isn't going to be compatible with the syntax for multi-statement transactions (both read-only and read/write) in MongoDB 4.0.

let numRetries =
    (cmdObj.hasOwnProperty("txnNumber") && !jsTest.options().skipRetryOnNetworkError)
    ? 1
    : 0;

Not only is this wrong because it's retrying a read, not a write, it's also wrong because it seems to retry indefinitely - the logs claim there is only one retry attempt, but it seems to just keep happening forever.

The expectation is that after a "not master" error the network connection would eventually be closed, which is why the number of retries isn't decremented. Is that not accurate? https://jira.mongodb.org/browse/SERVER-32804?focusedCommentId=1779462&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1779462 has some additional context for why this decision was made.

Comment by Spencer Brody (Inactive) [ 09/Mar/18 ]

max.hirschhorn

Comment by Spencer Brody (Inactive) [ 09/Mar/18 ]

Attached a repro that runs a read that will fail with NotMasterNoSlaveOk (a retryable error) through a session. With logRetryAttempts set to true, the test results in logs like the following:

[js_test:retryReads] 2018-03-09T15:41:35.093-0500 Error: error doing query: failed: network error while attempting to run command 'replSetStepDown' on host '127.0.0.1:10010'
[js_test:retryReads] 2018-03-09T15:41:35.093-0500
[js_test:retryReads] 2018-03-09T15:41:35.093-0500
[js_test:retryReads] 2018-03-09T15:41:35.093-0500 ----
[js_test:retryReads] 2018-03-09T15:41:35.093-0500 About to do query that should fail with not master
[js_test:retryReads] 2018-03-09T15:41:35.094-0500 ----
[js_test:retryReads] 2018-03-09T15:41:35.094-0500
[js_test:retryReads] 2018-03-09T15:41:35.094-0500
[js_test:retryReads] 2018-03-09T15:41:35.106-0500 2018-03-09T15:41:35.106-0500 I NETWORK  [js] trying reconnect to 127.0.0.1:10010 failed
[js_test:retryReads] 2018-03-09T15:41:35.106-0500 d10010| 2018-03-09T15:41:35.106-0500 I NETWORK  [listener] connection accepted from 127.0.0.1:60880 #19 (7 connections now open)
[js_test:retryReads] 2018-03-09T15:41:35.107-0500 d10010| 2018-03-09T15:41:35.107-0500 I NETWORK  [conn19] received client metadata from 127.0.0.1:60880 conn19: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "0.0.0" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "17.10" } }
[js_test:retryReads] 2018-03-09T15:41:35.107-0500 2018-03-09T15:41:35.107-0500 I NETWORK  [js] reconnect 127.0.0.1:10010 ok
[js_test:retryReads] 2018-03-09T15:41:35.112-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.114-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.117-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.118-0500 d10011| 2018-03-09T15:41:35.118-0500 I STORAGE  [repl writer worker 1] createCollection: test.foo with provided UUID: 47ef5616-6024-4d80-a67f-230958d0c31f
[js_test:retryReads] 2018-03-09T15:41:35.121-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.124-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.127-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.130-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
[js_test:retryReads] 2018-03-09T15:41:35.132-0500 =-=-=-= Retrying failed response with retryable code: 13435, for command: find, retries remaining: 1
...

Not only is this wrong because it's retrying a read, not a write, it's also wrong because it seems to retry indefinitely - the logs claim there is only one retry attempt, but it seems to just keep happening forever.

I believe this is masking test failures, and making some failures that do show up difficult to diagnose because the test fails later on than where the original failure actually was.

Generated at Thu Feb 08 04:34:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.