Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Operating System:
ALL
Steps To Reproduce:
Hide

Tested on the following versions：

4.2.11-rc1

- 5.0.6

OS version: centos 7

Reproduce code:

package main const collNum = 4096 const testDbName = "TestDB" func main() { log.InitLogrus("./transactions.log") mongosUri := "mongodb://xxxx" rs0Primary := "mongodb://xxxx" cli, err := connectToMongos(mongosUri) if err != nil { logrus.Error(err) return } defer cli.Disconnect(context.Background()) for i := 0; i <= collNum; i++ { coll := "test" + strconv.Itoa(i) err = mongoutil.EnableShard(cli, testDbName, coll, bson.M{"name": "hashed"}, false) if err != nil { logrus.Error(err) return } } go doStepDown(rs0Primary, 10*time.Second) doTransactions(cli) } func doStepDown(uri string, delay time.Duration) { time.Sleep(delay) cli, err := connectMongo(context.Background(), uri) if err != nil { logrus.Error(err) return } result := cli.Database("admin").RunCommand(context.Background(), bson.M{"replSetStepDown": 120}) if result.Err() != nil { logrus.Error(result.Err()) return } logrus.Infof("run stepdown command for %s success", uri) } func doTransactions(cli *mongo.Client) { for { WithTransactionExample(cli, testDbName) } }

WithTransactionExample function copy from https://docs.mongodb.com/manual/core/transactions/#transactions-api
Show
Tested on the following versions： 4.2.11-rc1 - 5.0.6 OS version: centos 7 Reproduce code: package main const collNum = 4096 const testDbName = "TestDB" func main() { log.InitLogrus( "./transactions.log" ) mongosUri := "mongodb: //xxxx" rs0Primary := "mongodb: //xxxx" cli, err := connectToMongos(mongosUri) if err != nil { logrus.Error(err) return } defer cli.Disconnect(context.Background()) for i := 0; i <= collNum; i++ { coll := "test" + strconv.Itoa(i) err = mongoutil.EnableShard(cli, testDbName, coll, bson.M{ "name" : "hashed" }, false ) if err != nil { logrus.Error(err) return } } go doStepDown(rs0Primary, 10*time.Second) doTransactions(cli) } func doStepDown(uri string, delay time.Duration) { time.Sleep(delay) cli, err := connectMongo(context.Background(), uri) if err != nil { logrus.Error(err) return } result := cli.Database( "admin" ).RunCommand(context.Background(), bson.M{ "replSetStepDown" : 120}) if result.Err() != nil { logrus.Error(result.Err()) return } logrus.Infof( "run stepdown command for %s success" , uri) } func doTransactions(cli *mongo.Client) { for { WithTransactionExample(cli, testDbName) } } WithTransactionExample function copy from https://docs.mongodb.com/manual/core/transactions/#transactions-api
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

I have a shard cluster with 4000 shard collections. After executing stepdown on one of the shards, many errors will occur when executing transactions:

(StaleConfig) Transaction 6ba3e857-e289-4ab1-a63c-c038a18bfc6c:614 was aborted on statement 1 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: epoch mismatch detected for xx.xx, the collection may have been dropped and recreated
find from config server's log:
[PeriodicShardedIndexConsistencyChecker] Attempt 0 to check index consistency for millionGroup.g_m_version1 received StaleShardVersion error :: caused by :: StaleConfig{ ns: "millionGroup.g_m_version1", vReceived: Timestamp(1, 3), vReceivedEpoch: ObjectId('6189f321bbcd3f66776bbe8a'), vWanted: Timestamp(0, 0), vWantedEpoch: ObjectId('000000000000000000000000') }: epoch mismatch detected for millionGroup.g_m_version1, the collection may have been dropped and recreated

Similarly, after adding a shard to the shard cluster, many errors will occur when executing transactions:

(StaleConfig) Transaction 324be44f-a3d4-4ee5-9fc4-9bbff6d53ffe:25 was aborted on statement 0 due to: an error from cluster data placement change :: caused by :: Encountered error from xx.xx.xx.xx:xxxx during a transaction :: caused by :: version mismatch detected for xx.xx

For the latter case, `jstests/sharding/transactions_stale_shard_version_errors.js` explains that transaction failure is an expected behavior after chunk migration.

And I can solve the above two problems by executing findOne (readpref is PrimaryMode) on each collection before executing the transaction after stepdown or chunk migration.

So my questions and suggestions are:

1. Is it an expected behavior that the first transaction executed on each collection is aborted after the stepdown is complete?
2. Why can't the catalog cache (or somethingelse) on the shard be updated in time to ensure that the transaction will not be aborted because of epoch/version mismatch?

Assignee:: Max Hirschhorn
Reporter:: beat jean
Participants:: beat jean, Chris Kelly, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 6 Start watching this issue

Created:: Mar 03 2022 07:25:41 AM UTC
Updated:: Jun 06 2023 03:31:44 PM UTC
Resolved:: Jun 06 2023 03:31:44 PM UTC

Details

Description

Attachments

Activity

People

Dates