Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-47841

List Indexes command failes with epoch change on migration on sharding_multiversion suite

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 4.7.0
    • Affects Version/s: 4.4.0-rc3
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • Hide

      1. Enable and run the sharding_multiversion suite. The attached patch can be used.

      Show
      1. Enable and run the sharding_multiversion suite. The attached patch can be used.
    • Sharding 2020-05-18
    • 0

      As part of SERVER-47822 the sharding_multiversion suite should be enabled, however, several tests fail with a similar backtrace:

      assert: command failed: {
      	"ok" : 0,
      	"errmsg" : "Data transfer error: migrate failed: StaleConfig{ ns: \"foo.bar\", vReceived: Timestamp(0, 0), vReceivedEpoch: ObjectId('000000000000000000000000'), vWanted: Timestamp(2, 0), vWantedEpoch: ObjectId('5ea989acce5433364888267d'), shardId: \"addshard5-rs1\" }: epoch mismatch detected for foo.bar",
      	"code" : 96,
      	"codeName" : "OperationFailed",
      	"operationTime" : Timestamp(1588169133, 9),
      	"$clusterTime" : {
      		"clusterTime" : Timestamp(1588169133, 9),
      		"signature" : {
      			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      			"keyId" : NumberLong(0)
      		}
      	}
      }
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      doassert@src/mongo/shell/assert.js:18:14
      _assertCommandWorked@src/mongo/shell/assert.js:663:17
      assert.commandWorked@src/mongo/shell/assert.js:755:16
      @jstests/sharding/addshard5.js:28:1
      @jstests/sharding/addshard5.js:3:2
      uncaught exception: Error: command failed: {
      	"ok" : 0,
      	"errmsg" : "Data transfer error: migrate failed: StaleConfig{ ns: \"foo.bar\", vReceived: Timestamp(0, 0), vReceivedEpoch: ObjectId('000000000000000000000000'), vWanted: Timestamp(2, 0), vWantedEpoch: ObjectId('5ea989acce5433364888267d'), shardId: \"addshard5-rs1\" }: epoch mismatch detected for foo.bar",
      	"code" : 96,
      	"codeName" : "OperationFailed",
      	"operationTime" : Timestamp(1588169133, 9),
      	"$clusterTime" : {
      		"clusterTime" : Timestamp(1588169133, 9),
      		"signature" : {
      			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
      			"keyId" : NumberLong(0)
      		}
      	}
      } :
      _getErrorWithCode@src/mongo/shell/utils.js:25:13
      doassert@src/mongo/shell/assert.js:18:14
      _assertCommandWorked@src/mongo/shell/assert.js:663:17
      assert.commandWorked@src/mongo/shell/assert.js:755:16
      @jstests/sharding/addshard5.js:28:1
      @jstests/sharding/addshard5.js:3:2
      failed to load: jstests/sharding/addshard5.js
      

      This is caused by an assert reached on the donor shard when trying to get the indexes during the migration of a chunk, the stacktrace can be seen below:

      src/mongo/util/stacktrace_posix.cpp:501:44: mongo::printStackTrace()
      src/mongo/util/assert_util.cpp:81:24: mongo::DBException::traceIfNeeded(mongo::DBException const&)
      src/mongo/util/assert_util.h:134:22: _ZN5mongo11DBExceptionC4ERKNS_6StatusE
      src/mongo/util/assert_util.h:151:66: _ZN5mongo18AssertionExceptionC4ERKNS_6StatusE
      src/mongo/util/assert_util.h:182:71: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)13388, mongo::ExceptionForCat<(mongo::ErrorCategory)4>, mongo::ExceptionForCat<(mongo::ErrorCategory)5> >::ExceptionForImpl(mongo::Status const&)
      build/cached/mongo/base/error_codes.cpp:2419:63: mongo::error_details::throwExceptionForStatus(mongo::Status const&)
      src/mongo/util/assert_util.cpp:257:43: mongo::uassertedWithLocation(mongo::Status const&, char const*, unsigned int)
      src/mongo/db/s/collection_sharding_runtime.cpp:381:5: operator()
      src/mongo/db/s/collection_sharding_runtime.cpp:381:5: mongo::CollectionShardingRuntime::_getMetadataWithVersionCheckAt(mongo::OperationContext*, boost::optional<mongo::LogicalTime> const&) (.cold.951)
      src/mongo/db/s/collection_sharding_runtime.cpp:162:5: mongo::CollectionShardingRuntime::checkShardVersionOrThrow(mongo::OperationContext*)
      src/mongo/db/db_raii.cpp:353:34: mongo::AutoGetCollectionForReadCommand::AutoGetCollectionForReadCommand(mongo::OperationContext*, mongo::NamespaceStringOrUUID const&, mongo::AutoGetCollection::ViewMode, mongo::Date_t, mongo::AutoStatsTracker::LogMode)
      src/mongo/db/commands/list_indexes.cpp:152:94: mongo::(anonymous namespace)::CmdListIndexes::run(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::BSONObjBuilder&)
      src/mongo/db/commands.h:799:19: mongo::BasicCommand::runWithReplyBuilder(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::rpc::ReplyBuilderInterface*)
      

      The logs on this failed task show the sequence of events:

      A shard starts a moveChunk process:

      d20771| 2020-04-29T14:05:32.785+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":"ns: foo.bar, [{ _id: MinKey }, { _id: MaxKey }), fromShard: addshard5-rs1, toShard: addshard5-rs0","collectionEpoch":{"$oid":"5ea989acce5433364888267d"}}
      

      The recipient receives the command and start the migrateThread:

      d20770| 2020-04-29T14:05:32.809+00:00 D4 TXN      23984   [migrateThread] "New transaction started","attr":{"txnNumber":0,"lsid":{"uuid":{"$uuid":"16fd9315-5698-4a4b-8275-781d40de4782"}}}
      d20770| 2020-04-29T14:05:32.809+00:00 I  MIGRATE  22000   [migrateThread] "Starting receiving end of chunk migration","attr":{"chunkMin":{"_id":{"$minKey":1}},"chunkMax":{"_id":{"$maxKey":1}},"namespace":"foo.bar","fromShard":"addshard5-rs1","epoch":{"$oid":"5ea989acce5433364888267d"},"sessionId":"addshard5-rs1_addshard5-rs0_5ea989ac4beaa7b990cb1c02","migrationId":{"uuid":{"$uuid":"3c3be56a-5b3a-4de8-abf2-beac588042f3"}}} d20771| 2020-04-29T14:05:32.785+00:00 I  MIGRATE  22016   [MoveChunk] "Starting chunk migration donation","attr":{"requestParameters":"ns: foo.bar, [{ _id: MinKey }, { _id: MaxKey }), fromShard: addshard5-rs1, toShard: addshard5-rs0","collectionEpoch":{"$oid":"5ea989acce5433364888267d"}}
      

      As part of the process, it requests the index:

      d20771| 2020-04-29T14:05:32.970+00:00 I  COMMAND  51803   [conn35] "Slow query","attr":{"type":"command","ns":"foo.bar","command":{"listIndexes":"bar","$clusterTime":{"clusterTime":{"$timestamp":{"t":1588169132,"i":61}},"signature":{"hash":{"$binary":{"base64":"uIQUNHXCtJMnziE9Hq2R82DIcOY=","subType":"0"}},"keyId":6821134443802001430}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1588169132,"i":57}},"t":1}},"$db":"foo"},"numYields":0,"ok":0,"errMsg":"epoch mismatch detected for foo.bar","errName":13388,"errCode":13388,"reslen":550,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":2}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"r":3}},"Database":{"acquireCount":{"r":3}},"Collection":{"acquireCount":{"r":3}},"Mutex":{"acquireCount":{"r":4}}},"protocol":"op_msg","durationMillis":161}
      

      But, this operation fails with an exception:

      d20770| 2020-04-29T14:05:32.971+00:00 W  -        23075   [migrateThread] "DBException thrown","attr":{"error":{"code":13388,"codeName":"StaleConfig","errmsg":"epoch mismatch detected for foo.bar","ns":"foo.bar","vReceived":{"$timestamp":{"t":0,"i":0}},"vReceivedEpoch":{"$oid":"000000000000000000000000"},"vWanted":{"$timestamp":{"t":2,"i":0}},"vWantedEpoch":{"$oid":"5ea989acce5433364888267d"},"shardId":"addshard5-rs1"}}
      

      The donor shard (the process on port 20771) has version 4.4.0-rc3-11-ge33cb0c:

      d20771| 2020-04-29T14:05:22.089+00:00 I  CONTROL  23403   [initandlisten] "Build Info","attr":{"buildInfo":{"version":"4.4.0-rc3-11-ge33cb0c","gitVersion":"e33cb0c81df6901645788f1df484c2d492c08b1f","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}
      

      The recipient (on port 20770), version 4.5.0-979-ga1b1a3d

      d20770| 2020-04-29T14:05:22.085+00:00 I  CONTROL  23403   [initandlisten] "Build Info","attr":{"buildInfo":{"version":"4.5.0-979-ga1b1a3d","gitVersion":"a1b1a3db6286e9337d80602a79c905ee41828e03","openSSLVersion":"OpenSSL 1.0.1e-fips 11 Feb 2013","modules":["enterprise"],"allocator":"tcmalloc","environment":{"distmod":"rhel62","distarch":"x86_64","target_arch":"x86_64"}}}
      

            Assignee:
            kaloian.manassiev@mongodb.com Kaloian Manassiev
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: