Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-64145

The transaction is aborted even after stepdown or chunk migration is completed

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • ALL
    • Hide

      Tested on the following versions:

      • 4.2.11-rc1
      • - 5.0.6

      OS version: centos 7

      Reproduce code:

      package main
      
      const collNum = 4096
      const testDbName = "TestDB"
      
      func main() {
      	log.InitLogrus("./transactions.log")
      
      	mongosUri := "mongodb://xxxx"
      	rs0Primary := "mongodb://xxxx"
      
      	cli, err := connectToMongos(mongosUri)
      	if err != nil {
      		logrus.Error(err)
      		return
      	}
      	defer cli.Disconnect(context.Background())
      
      	for i := 0; i <= collNum; i++ {
      		coll := "test" + strconv.Itoa(i)
      		err = mongoutil.EnableShard(cli, testDbName, coll, bson.M{"name": "hashed"}, false)
      		if err != nil {
      			logrus.Error(err)
      			return
      		}
      	}
      
      	go doStepDown(rs0Primary, 10*time.Second)
      
      	doTransactions(cli)
      }
      
      func doStepDown(uri string, delay time.Duration) {
      	time.Sleep(delay)
      	cli, err := connectMongo(context.Background(), uri)
      	if err != nil {
      		logrus.Error(err)
      		return
      	}
      	result := cli.Database("admin").RunCommand(context.Background(), bson.M{"replSetStepDown": 120})
      	if result.Err() != nil {
      		logrus.Error(result.Err())
      		return
      	}
      	logrus.Infof("run stepdown command for %s success", uri)
      }
      
      func doTransactions(cli *mongo.Client) {
      	for {
      		WithTransactionExample(cli, testDbName)
      	}
      }

      WithTransactionExample function copy from  https://docs.mongodb.com/manual/core/transactions/#transactions-api

      Show
      Tested on the following versions: 4.2.11-rc1 - 5.0.6 OS version: centos 7 Reproduce code: package main const collNum = 4096 const testDbName = "TestDB" func main() { log.InitLogrus( "./transactions.log" ) mongosUri := "mongodb: //xxxx" rs0Primary := "mongodb: //xxxx" cli, err := connectToMongos(mongosUri) if err != nil { logrus.Error(err) return } defer cli.Disconnect(context.Background()) for i := 0; i <= collNum; i++ { coll := "test" + strconv.Itoa(i) err = mongoutil.EnableShard(cli, testDbName, coll, bson.M{ "name" : "hashed" }, false ) if err != nil { logrus.Error(err) return } } go doStepDown(rs0Primary, 10*time.Second) doTransactions(cli) } func doStepDown(uri string, delay time.Duration) { time.Sleep(delay) cli, err := connectMongo(context.Background(), uri) if err != nil { logrus.Error(err) return } result := cli.Database( "admin" ).RunCommand(context.Background(), bson.M{ "replSetStepDown" : 120}) if result.Err() != nil { logrus.Error(result.Err()) return } logrus.Infof( "run stepdown command for %s success" , uri) } func doTransactions(cli *mongo.Client) { for { WithTransactionExample(cli, testDbName) } } WithTransactionExample function copy from  https://docs.mongodb.com/manual/core/transactions/#transactions-api

      I have a shard cluster with 4000 shard collections. After executing stepdown on one of the shards, many errors will occur when executing transactions:

      (StaleConfig) Transaction 6ba3e857-e289-4ab1-a63c-c038a18bfc6c:614 was aborted on statement 1 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: epoch mismatch detected for xx.xx, the collection may have been dropped and recreated
      find from config server's log:
      [PeriodicShardedIndexConsistencyChecker] Attempt 0 to check index consistency for millionGroup.g_m_version1 received StaleShardVersion error :: caused by :: StaleConfig{ ns: "millionGroup.g_m_version1", vReceived: Timestamp(1, 3), vReceivedEpoch: ObjectId('6189f321bbcd3f66776bbe8a'), vWanted: Timestamp(0, 0), vWantedEpoch: ObjectId('000000000000000000000000') }: epoch mismatch detected for millionGroup.g_m_version1, the collection may have been dropped and recreated
      

       

      Similarly, after adding a shard to the shard cluster, many errors will occur when executing transactions:

      (StaleConfig) Transaction 324be44f-a3d4-4ee5-9fc4-9bbff6d53ffe:25 was aborted on statement 0 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: version mismatch detected for xx.xx

      For the latter case, `jstests/sharding/transactions_stale_shard_version_errors.js` explains that transaction failure is an expected behavior after chunk migration.

      And I can solve the above two problems by executing findOne (readpref is PrimaryMode) on each collection before executing the transaction after stepdown or chunk migration.

      So my questions and suggestions are:

      1. Is it an expected behavior that the first transaction executed on each collection is aborted after the stepdown is complete?
      2. Why can't the catalog cache (or somethingelse) on the shard be updated in time to ensure that the transaction will not be aborted because of epoch/version mismatch?

       

            Assignee:
            max.hirschhorn@mongodb.com Max Hirschhorn
            Reporter:
            beatjean1314@gmail.com beat jean
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: