[SERVER-47841] List Indexes command failes with epoch change on migration on sharding_multiversion suite Created: 29/Apr/20  Updated: 29/Oct/23  Resolved: 05/May/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.4.0-rc3
Fix Version/s: 4.7.0

Type: Bug Priority: Major - P3
Reporter: Marcos José Grillo Ramirez Assignee: Kaloian Manassiev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File sharding_multiversion.patch    
Issue Links:
Depends
is depended on by SERVER-47907 Remove unnecessary shard from index_c... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. Enable and run the sharding_multiversion suite. The attached patch can be used.

Sprint: Sharding 2020-05-18
Participants:
Linked BF Score: 0

 Description   

As part of SERVER-47822 the sharding_multiversion suite should be enabled, however, several tests fail with a similar backtrace:

assert: command failed: {
	"ok" : 0,
	"errmsg" : "Data transfer error: migrate failed: StaleConfig{ ns: \"foo.bar\", vReceived: Timestamp(0, 0), vReceivedEpoch: ObjectId('000000000000000000000000'), vWanted: Timestamp(2, 0), vWantedEpoch: ObjectId('5ea989acce5433364888267d'), shardId: \"addshard5-rs1\" }: epoch mismatch detected for foo.bar",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1588169133, 9),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1588169133, 9),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:663:17
assert.commandWorked@src/mongo/shell/assert.js:755:16
@jstests/sharding/addshard5.js:28:1
@jstests/sharding/addshard5.js:3:2
uncaught exception: Error: command failed: {
	"ok" : 0,
	"errmsg" : "Data transfer error: migrate failed: StaleConfig{ ns: \"foo.bar\", vReceived: Timestamp(0, 0), vReceivedEpoch: ObjectId('000000000000000000000000'), vWanted: Timestamp(2, 0), vWantedEpoch: ObjectId('5ea989acce5433364888267d'), shardId: \"addshard5-rs1\" }: epoch mismatch detected for foo.bar",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1588169133, 9),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1588169133, 9),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
} :
_getErrorWithCode@src/mongo/shell/utils.js:25:13
doassert@src/mongo/shell/assert.js:18:14
_assertCommandWorked@src/mongo/shell/assert.js:663:17
assert.commandWorked@src/mongo/shell/assert.js:755:16
@jstests/sharding/addshard5.js:28:1
@jstests/sharding/addshard5.js:3:2
failed to load: jstests/sharding/addshard5.js

This is caused by an assert reached on the donor shard when trying to get the indexes during the migration of a chunk, the stacktrace can be seen below:

src/mongo/util/stacktrace_posix.cpp:501:44: mongo::printStackTrace()
src/mongo/util/assert_util.cpp:81:24: mongo::DBException::traceIfNeeded(mongo::DBException const&)
src/mongo/util/assert_util.h:134:22: _ZN5mongo11DBExceptionC4ERKNS_6StatusE
src/mongo/util/assert_util.h:151:66: _ZN5mongo18AssertionExceptionC4ERKNS_6StatusE
src/mongo/util/assert_util.h:182:71: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)13388, mongo::ExceptionForCat<(mongo::ErrorCategory)4>, mongo::ExceptionForCat<(mongo::ErrorCategory)5> >::ExceptionForImpl(mongo::Status const&)
build/cached/mongo/base/error_codes.cpp:2419:63: mongo::error_details::throwExceptionForStatus(mongo::Status const&)
src/mongo/util/assert_util.cpp:257:43: mongo::uassertedWithLocation(mongo::Status const&, char const*, unsigned int)
src/mongo/db/s/collection_sharding_runtime.cpp:381:5: operator()
src/mongo/db/s/collection_sharding_runtime.cpp:381:5: mongo::CollectionShardingRuntime::_getMetadataWithVersionCheckAt(mongo::OperationContext*, boost::optional<mongo::LogicalTime> const&) (.cold.951)
src/mongo/db/s/collection_sharding_runtime.cpp:162:5: mongo::CollectionShardingRuntime::checkShardVersionOrThrow(mongo::OperationContext*)
src/mongo/db/db_raii.cpp:353:34: mongo::AutoGetCollectionForReadCommand::AutoGetCollectionForReadCommand(mongo::OperationContext*, mongo::NamespaceStringOrUUID const&, mongo::AutoGetCollection::ViewMode, mongo::Date_t, mongo::AutoStatsTracker::LogMode)
src/mongo/db/commands/list_indexes.cpp:152:94: mongo::(anonymous namespace)::CmdListIndexes::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::BSONObjBuilder&)
src/mongo/db/commands.h:799:19: mongo::BasicCommand::runWithReplyBuilder(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::rpc::ReplyBuilderInterface*)

The logs on this failed task show the sequence of events:

A shard starts a moveChunk process:

d20771| 2020-04-29T14:05:32.785+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":"ns: foo.bar, [{ _id: MinKey }, { _id: MaxKey }), fromShard: addshard5-rs1, toShard: addshard5-rs0","collectionEpoch":{"$oid":"5ea989acce5433364888267d"}}

The recipient receives the command and start the migrateThread:

d20770| 2020-04-29T14:05:32.809+00:00 D4 TXN      23984   [migrateThread] "New transaction started","attr":{"txnNumber":0,"lsid":{"uuid":{"$uuid":"16fd9315-5698-4a4b-8275-781d40de4782"}}}
d20770| 2020-04-29T14:05:32.809+00:00 I  MIGRATE  22000   [migrateThread] "Starting receiving end of chunk migration","attr":{"chunkMin":{"_id":{"$minKey":1}},"chunkMax":{"_id":{"$maxKey":1}},"namespace":"foo.bar","fromShard":"addshard5-rs1","epoch":{"$oid":"5ea989acce5433364888267d"},"sessionId":"addshard5-rs1_addshard5-rs0_5ea989ac4beaa7b990cb1c02","migrationId":{"uuid":{"$uuid":"3c3be56a-5b3a-4de8-abf2-beac588042f3"}}} d20771| 2020-04-29T14:05:32.785+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":"ns: foo.bar, [{ _id: MinKey }, { _id: MaxKey }), fromShard: addshard5-rs1, toShard: addshard5-rs0","collectionEpoch":{"$oid":"5ea989acce5433364888267d"}}

As part of the process, it requests the index:

d20771| 2020-04-29T14:05:32.970+00:00 I  COMMAND  51803   [conn35] "Slow query","attr":{"type":"command","ns":"foo.bar","command":{"listIndexes":"bar","$clusterTime":{"clusterTime":{"$timestamp":{"t":1588169132,"i":61}},"signature":{"hash":{"$binary":{"base64":"uIQUNHXCtJMnziE9Hq2R82DIcOY=","subType":"0"}},"keyId":6821134443802001430}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1588169132,"i":57}},"t":1}},"$db":"foo"},"numYields":0,"ok":0,"errMsg":"epoch mismatch detected for foo.bar","errName":13388,"errCode":13388,"reslen":550,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":2}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"r":3}},"Database":{"acquireCount":{"r":3}},"Collection":{"acquireCount":{"r":3}},"Mutex":{"acquireCount":{"r":4}}},"protocol":"op_msg","durationMillis":161}

But, this operation fails with an exception:

d20770| 2020-04-29T14:05:32.971+00:00 W  -        23075   [migrateThread] "DBException thrown","attr":{"error":{"code":13388,"codeName":"StaleConfig","errmsg":"epoch mismatch detected for foo.bar","ns":"foo.bar","vReceived":{"$timestamp":{"t":0,"i":0}},"vReceivedEpoch":{"$oid":"000000000000000000000000"},"vWanted":{"$timestamp":{"t":2,"i":0}},"vWantedEpoch":{"$oid":"5ea989acce5433364888267d"},"shardId":"addshard5-rs1"}}

The donor shard (the process on port 20771) has version 4.4.0-rc3-11-ge33cb0c:

d20771| 2020-04-29T14:05:22.089+00:00 I  CONTROL  23403   [initandlisten] "Build Info","attr":{"buildInfo":{"version":"4.4.0-rc3-11-ge33cb0c","gitVersion":"e33cb0c81df6901645788f1df484c2d492c08b1f","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}

The recipient (on port 20770), version 4.5.0-979-ga1b1a3d

d20770| 2020-04-29T14:05:22.085+00:00 I  CONTROL  23403   [initandlisten] "Build Info","attr":{"buildInfo":{"version":"4.5.0-979-ga1b1a3d","gitVersion":"a1b1a3db6286e9337d80602a79c905ee41828e03","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}



 Comments   
Comment by Githook User [ 05/May/20 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-47841 Ensure 4.5 nodes always send SSV with

{kNoConnectionVersioning:true}

Branch: master
https://github.com/mongodb/mongo/commit/a439fbcb0574493648a7a92a4e91117ba79086ed

Comment by Kaloian Manassiev [ 05/May/20 ]

jack.mulrow, yes you are correct. Without sending the noConnectionVersioning parameter, it will default to false in the setShardVersion command.

The fix should be to just add it to true by default to SetShardVersionRequest::toBSON().

Comment by Jack Mulrow [ 01/May/20 ]

I ran into this issue when debugging BF-17074, so I looked into it and I think it's caused by the setShardVersion changes in SERVER-47426, in particular the removal of the "noConnectionVersioning" parameter.

The listIndexes that triggers the StaleConfig error is intentionally sent by the recipient shard without a shard version (because the targeting doesn't depend on the chunk distribution), but the destination shard treats this as an explicit UNSHARDED shard version for some reason. I reproduced the issue with more logging in index_commands_shard_targeting.js, and it turns out this is because the command was executing on a client that had previously established a ShardedConnectionInfo decoration, which triggered this branch when getting the command's version for the listIndexes shard version check.

The reason the connection has a ShardedConnectionInfo seems to be because of this setShardVersion request (SSV) sent by a donor shard to the recipient at the end of a migration. Prior to SERVER-47426, this request included "noConnectionVersioning:true" (through the SetShardVersionRequest::makeForVersioningNoPersist() helper here), which prevented the donor shard from creating a ShardedConnectionInfo. In master, ShardedConnectionInfo has been removed, so this isn't a problem, but in mixed version clusters this SSV can lead a last-stable binary shard to incorrectly version the connection.

I was able to fix index_commands_shard_targeting.js by adding back "noConnectionVersioning:true" to that SSV in master, which might be enough to fix the failures in this ticket too.

Generated at Thu Feb 08 05:15:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.