[SERVER-58407] Resharding components do not retry on FailedToSatisfyReadPreference when targeting remote shard, leading to server crash Created: 09/Jul/21  Updated: 29/Oct/23  Resolved: 22/Oct/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc2

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-autocommits
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-57686 We need test coverage that runs resha... Closed
Related
related to SERVER-60495 Retry FailedToSatisfyReadPreference i... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-11-01
Participants:
Story Points: 1

 Description   

There are multiple places where the ReshardingCoordinatorService, ReshardingRecipientService, and ReshardingDonorService attempt to target the primary of a replica set shard:

Internally, these function calls go through RemoteCommandTargeterRS::findHost() and will throw a FailedToSatisfyReadPreference after kDefaultFindHostTimeout 15 seconds if a primary is unavailable on the remote shard. This exception is caught and leads to an fassert() because, for example, it would be invalid for the participant shards to complete the resharding operation without performing a w:majority on the config server primary.

The resharding components should instead wait until a primary becomes available on the remote shard to avoid triggering this fassert().

[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.440+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-4","msg":"Transitioned resharding recipient state","attr":{"newState":"applying","oldState":"cloning","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20002","replicaSet":"config-rs"}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20000","replicaSet":"config-rs"}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333218, "ctx":"ReshardingRecipientService-7","msg":"Rescheduling the next replica set monitoring request","attr":{"replicaSet":"config-rs","host":"localhost:20000","delayMillis":0}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:33.465+00:00"},"s":"I",  "c":"-",        "id":4333227, "ctx":"ReshardingRecipientService-7","msg":"RSM monitoring host in expedited mode until we detect a primary","attr":{"host":"localhost:20001","replicaSet":"config-rs"}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.465+00:00"},"s":"I",  "c":"RESHARD",  "id":4956500, "ctx":"ReshardingRecipientService-7","msg":"Resharding operation recipient state machine failed","attr":{"namespace":"test1_fsmdb0.fsmcoll0","reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}},"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:17:48.466+00:00"},"s":"I",  "c":"RESHARD",  "id":5279506, "ctx":"ReshardingRecipientService-7","msg":"Transitioned resharding recipient state","attr":{"newState":"error","oldState":"applying","namespace":"test1_fsmdb0.fsmcoll0","collectionUUID":{"uuid":{"$uuid":"04cb2914-75ec-4b6c-a4df-f416c22459c7"}},"reshardingUUID":{"uuid":{"$uuid":"245e08b2-5f20-4530-a87e-1acd0faa2db4"}}}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"RESHARD",  "id":5551101, "ctx":"ReshardingRecipientService-5","msg":"Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"primary\" } for set config-rs"}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23089,   "ctx":"ReshardingRecipientService-5","msg":"Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":412}}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"ASSERT",   "id":23090,   "ctx":"ReshardingRecipientService-5","msg":"\n\n***aborting after fassert() failure\n\n"}
[j0:s0:n1] {"t":{"$date":"2021-10-19T13:18:03.470+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"ReshardingRecipientService-5","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}

https://logkeeper.mongodb.org/lobster/build/e6cde008d39df341d7eb3b22eff2f45b/test/616ec55854f248304d3ff5b3#bookmarks=0%2C10033%2C10034%2C10035%2C10036%2C10038%2C10039%2C10040%2C10041%2C10054%2C10088%2C593975&f~=100~%28%5C%5Bj0%3As0%3An1%5C%5D%7C%5C%5Bj0%3As1%3An2%5C%5D%29&l=1&shareLine=10033



 Comments   
Comment by Githook User [ 22/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-58407 Retry on FailedToSatisfyReadPreference in resharding.

(cherry picked from commit 03bec439f7c1ce1d8242de40eea130d9a3518a28)
Branch: v5.1
https://github.com/mongodb/mongo/commit/d9defb747d56367b77a326b81e80bb9091bceefb

Comment by Githook User [ 22/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-58407 Retry on FailedToSatisfyReadPreference in resharding.

(cherry picked from commit 03bec439f7c1ce1d8242de40eea130d9a3518a28)
Branch: v5.0
https://github.com/mongodb/mongo/commit/698ac509069bf246f497e07e1e6aa0ad8a949ac3

Comment by Githook User [ 22/Oct/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-58407 Retry on FailedToSatisfyReadPreference in resharding.
Branch: master
https://github.com/mongodb/mongo/commit/03bec439f7c1ce1d8242de40eea130d9a3518a28

Comment by Max Hirschhorn [ 21/Oct/21 ]

I've gone ahead and updated the ticket description with an example of the recipient shard not retrying on FailedToSatisfyReadPreference while the config server primary is unavailable leads to the recipient shard crashing.

Comment by Blake Oler [ 26/Jul/21 ]

I don't remember at the moment, but it's most likely stemming from trying to send commands to remote shards. I'll be sure to update once I see it again.

Comment by Max Hirschhorn [ 23/Jul/21 ]

blake.oler, could you clarify from which component you had observed a FailedToSatisfyReadPreference exception?

Similar to SERVER-58389, my preference would be to have resharding wait an indefinite amount of time for the replica set monitor to yield a suitable host rather than retrying. This way the error Status is avoided entirely.

Generated at Thu Feb 08 05:44:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.