[SERVER-47632] Update shard key concurrency workload ignores more moveChunk errors than intended Created: 17/Apr/20  Updated: 29/Oct/23  Resolved: 28/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc4, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-47631 Refine shard key concurrency workload... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-05-04
Participants:

 Description   

The fsm workload random_moveChunk_update_shard_key.js allows migrations to fail with certain acceptable errors, but this check that a failed migration's error message contains a certain string is missing a > -1, which leads to a truthy value when the string is not present, leading the workload to ignore more errors than was intended. This check should be fixed and any acceptable errors implicitly ignored because of this behavior should instead be explicitly ignored.



 Comments   
Comment by Githook User [ 28/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-47632 Explicitly ignore all acceptable moveChunk errors in update shard key concurrency workload

(cherry picked from commit f4800dd5dea55597b8984e547771aee5282f0386)
Branch: v4.4
https://github.com/mongodb/mongo/commit/97436b86d3821dc347b41341141d4998e3f5caf2

Comment by Githook User [ 28/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-47632 Explicitly ignore all acceptable moveChunk errors in update shard key concurrency workload
Branch: master
https://github.com/mongodb/mongo/commit/f4800dd5dea55597b8984e547771aee5282f0386

Comment by Max Hirschhorn [ 25/Apr/20 ]

In examining this failure from Gregory's patch build, I've now realized it is possible for the moveChunk command to fail as a result of a duplicate key error on the recipient when document shard key values are being updated. This is because receiving a chunk only waits for documents with shard key values in that range to have been cleaned up by the range deleter. If the shard key value for a document was updated after the chunk had been moved off the shard, then it is possible for the range deleter to not yet have cleaned up that document when trying to the range back to the original shard.

[ShardedClusterFixture:job0:shard1:primary] | 2020-04-24T19:58:07.502+00:00 I  SHARDING 22080   [MoveChunk] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20004-2020-04-24T19:58:07.502+00:00-5ea344cffb7035f4f1ebd055","server":"ip-10-122-65-30:20004","shard":"shard-rs1","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.502Z"},"what":"moveChunk.from","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":400.0},"max":{"skey":500.0},"step 1 of 6":0,"step 2 of 6":3,"step 3 of 6":12,"step 4 of 6":1998,"step 5 of 6":13,"step 6 of 6":24,"to":"shard-rs0","from":"shard-rs1","note":"success"}}}
[ShardedClusterFixture:job0:shard1:primary] | 2020-04-24T19:58:07.612+00:00 I  SHARDING 22080   [MoveChunk] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20004-2020-04-24T19:58:07.612+00:00-5ea344cffb7035f4f1ebd0e2","server":"ip-10-122-65-30:20004","shard":"shard-rs1","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.612Z"},"what":"moveChunk.from","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":500.0},"max":{"skey":600.0},"step 1 of 6":0,"step 2 of 6":3,"step 3 of 6":13,"step 4 of 6":6,"step 5 of 6":14,"step 6 of 6":49,"to":"shard-rs0","from":"shard-rs1","note":"success"}}}
...
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.624+0000 [jsTest] Running update as a retryable write. The document will move across chunks. Original document values -- id: 450, shardKey: 450, counter: 0
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.624+0000 [jsTest] Intended new document values -- shardKey: 576, counter: 1
...
[ShardedClusterFixture:job0:shard0:primary] | 2020-04-24T19:58:07.624+00:00 D4 TXN      23984   [conn111] "New transaction started","attr":{"txnNumber":13,"lsid":{"uuid":{"$uuid":"59e19a65-4d33-418c-9328-fecd9f77846d"}}}
[ShardedClusterFixture:job0:mongos0] | 2020-04-24T19:58:07.624+00:00 D3 EXECUTOR 22607   [conn40] "Scheduling remote command request","attr":{"request":"RemoteCommand 1452 -- target:[localhost:20001] db:test0_fsmdb0 cmd:{ update: \"fsmcoll0\", bypassDocumentValidation: false, ordered: true, stmtIds: [ 0 ], updates: [ { q: { _id: 450.0, skey: 450.0 }, u: { _id: 450.0, skey: 576.0, counter: 1.0 }, multi: false, upsert: false } ], runtimeConstants: { localNow: new Date(1587758287624), clusterTime: Timestamp(1587758287, 168) }, shardVersion: [ Timestamp(41, 0), ObjectId('5ea344b2fb7035f4f1eba427') ], writeConcern: {}, lsid: { id: UUID(\"59e19a65-4d33-418c-9328-fecd9f77846d\"), uid: BinData(0, E3B0C44298FC1C149AFBF4C8996FB92427AE41E4649B934CA495991B7852B855) }, txnNumber: 13 }"}
[ShardedClusterFixture:job0:mongos0] | 2020-04-24T19:58:07.624+00:00 D2 ASIO     22597   [conn40] "Request finished with response","attr":{"requestId":1452,"isOK":true,"response":"{ n: 1, nModified: 1, opTime: { ts: Timestamp(1587758287, 169), t: 1 }, electionId: ObjectId('7fffffff0000000000000001'), ok: 1.0, readOnly: false, $gleStats: { lastOpTime: { ts: Timestamp(1587758287, 169), t: 1 }, electionId: ObjectId('7fffffff0000000000000001') }, lastCommittedOpTime: Timestamp(1587758287, 165), $configServerState: { opTime: { ts: Timestamp(1587758287, 168), t: 1 } }, $clusterTime: { clusterTime: Timestamp(1587758287, 169), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1587758287, 169) }"}
...
[ShardedClusterFixture:job0:shard0:primary] | 2020-04-24T19:58:07.697+00:00 I  SHARDING 22080   [MoveChunk] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20001-2020-04-24T19:58:07.697+00:00-5ea344cfbe81891a8941f55e","server":"ip-10-122-65-30:20001","shard":"shard-rs0","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.697Z"},"what":"moveChunk.start","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":500.0},"max":{"skey":600.0},"from":"shard-rs0","to":"shard-rs1"}}}
[ShardedClusterFixture:job0:shard1:primary] | 2020-04-24T19:58:07.717+00:00 I  MIGRATE  21999   [chunkInserter] "Batch insertion failed","attr":{"error":"DuplicateKey{ keyPattern: { _id: 1 }, keyValue: { _id: 450.0 } }: Insert of { _id: 450.0, skey: 576.0, counter: 1.0 } failed. :: caused by :: E11000 duplicate key error collection: test0_fsmdb0.fsmcoll0 index: _id_ dup key: { _id: 450.0 }"}
[ShardedClusterFixture:job0:shard1:primary] | 2020-04-24T19:58:07.717+00:00 I  SHARDING 22080   [migrateThread] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20004-2020-04-24T19:58:07.717+00:00-5ea344cffb7035f4f1ebd0ff","server":"ip-10-122-65-30:20004","shard":"shard-rs1","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.717Z"},"what":"moveChunk.to","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":500.0},"max":{"skey":600.0},"step 1 of 6":4,"step 2 of 6":0,"to":"shard-rs1","from":"shard-rs0","note":"aborted"}}}
[ShardedClusterFixture:job0:shard0:primary] | 2020-04-24T19:58:07.719+00:00 I  SHARDING 21993   [MoveChunk] "moveChunk data transfer progress","attr":{"response":{"waited":true,"active":false,"ns":"test0_fsmdb0.fsmcoll0","from":"shard-rs0/localhost:20001,localhost:20002,localhost:20003","fromShardId":"shard-rs0","min":{"skey":500.0},"max":{"skey":600.0},"shardKeyPattern":{"skey":1.0},"state":"fail","errmsg":"migrate failed: Location51008: operation was interrupted","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1.0,"$gleStats":{"lastOpTime":{"ts":{"$timestamp":{"t":1587758285,"i":350}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"}},"lastCommittedOpTime":{"$timestamp":{"t":1587758287,"i":180}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1587758287,"i":187}},"t":1}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1587758287,"i":187}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1587758287,"i":180}}},"memoryUsedBytes":0,"docsRemainingToClone":0}
[ShardedClusterFixture:job0:shard0:primary] | 2020-04-24T19:58:07.719+00:00 I  SHARDING 22080   [MoveChunk] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20001-2020-04-24T19:58:07.719+00:00-5ea344cfbe81891a8941f57c","server":"ip-10-122-65-30:20001","shard":"shard-rs0","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.719Z"},"what":"moveChunk.error","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":500.0},"max":{"skey":600.0},"from":"shard-rs0","to":"shard-rs1"}}}
[ShardedClusterFixture:job0:shard0:primary] | 2020-04-24T19:58:07.737+00:00 I  SHARDING 22080   [MoveChunk] "About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ip-10-122-65-30:20001-2020-04-24T19:58:07.737+00:00-5ea344cfbe81891a8941f592","server":"ip-10-122-65-30:20001","shard":"shard-rs0","clientAddr":"","time":{"$date":"2020-04-24T19:58:07.737Z"},"what":"moveChunk.from","ns":"test0_fsmdb0.fsmcoll0","details":{"min":{"skey":500.0},"max":{"skey":600.0},"step 1 of 6":0,"step 2 of 6":2,"step 3 of 6":13,"to":"shard-rs1","from":"shard-rs0","note":"aborted"}}}
...
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.764+0000         	"errmsg" : "Data transfer error: migrate failed: Location51008: operation was interrupted",
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.764+0000         	"codeName" : "OperationFailed",
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.765+0000         			"skey" : 500
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.765+0000         			"skey" : 600
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.765+0000         	"to" : "shard-rs1",
[fsm_workload_test:random_moveChunk_update_shard_key] 2020-04-24T19:58:07.765+0000         	"_waitForDelete" : true

https://logkeeper.mongodb.org/lobster/build/8b5175147eca9145a219027f85c4d202/test/5ea344b2c2ab685b225d24b0#bookmarks=0%2C48508%2C48728%2C48793%2C48794%2C48797%2C48798%2C48804%2C49282%2C49335%2C49340%2C49345%2C49346%2C49388%2C49437%2C49439%2C49452%2C49455%2C49458%2C49459%2C50062&f~=000~59e19a65-4d33-418c-9328-fecd9f77846d&f~=100~%5C%5BMoveChunk%5C%5D.%2AAbout%20to%20log%20metadata%20event&l=1

Comment by Max Hirschhorn [ 23/Apr/20 ]

To whoever picks up this ticket - See this comment on SERVER-47631 for how I'd recommend approaching it.

Generated at Thu Feb 08 05:14:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.