[SERVER-47631] Refine shard key concurrency workloads ignore more moveChunk errors than intended Created: 17/Apr/20  Updated: 29/Oct/23  Resolved: 27/Apr/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 4.4.0-rc4, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Gregory Noma
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
related to SERVER-47632 Update shard key concurrency workload... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Sharding 2020-05-04
Participants:

 Description   

The fsm workload random_moveChunk_refine_collection_shard_key.js and the two workoads that inherit from it allow migrations to fail with certain acceptable errors, but this check that a failed migration's error message contains a certain string is missing a > -1, which leads to a truthy value when the string is not present, leading the workloads to ignore more errors than was intended. This check should be fixed and any acceptable errors implicitly ignored because of this behavior should instead be explicitly ignored.



 Comments   
Comment by Githook User [ 28/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-47631 Explicitly ignore all acceptable moveChunk errors in refine shard key concurrency workloads

(cherry picked from commit 894c71a61011c309be666fd59e87b25cd5f52681)
Branch: v4.4
https://github.com/mongodb/mongo/commit/72794710b95929bd9ea9dd8b75040bb47eeba327

Comment by Githook User [ 27/Apr/20 ]

Author:

{'name': 'Gregory Noma', 'email': 'gregory.noma@gmail.com', 'username': 'gregorynoma'}

Message: SERVER-47631 Explicitly ignore all acceptable moveChunk errors in refine shard key concurrency workloads
Branch: master
https://github.com/mongodb/mongo/commit/894c71a61011c309be666fd59e87b25cd5f52681

Comment by Max Hirschhorn [ 23/Apr/20 ]

My suggestion on the approach for this ticket would be to

  1. Change the random_moveChunk_refine_collection_shard_key.js FSM workload to use String.prototype.includes and remove all of the comparisons around -1 for readability.
  2. Schedule the burn_in_tests_gen task in a patch build to run the test repeatedly and identify any errors from moveChunk which should be handled explicitly.
    • The random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction.js and random_moveChunk_refine_collection_shard_key_broadcast_update_transaction.js FSM workloads derive from random_moveChunk_refine_collection_shard_key.js and should also be run as part of this process. One trick for this is to make a dummy change to those files so burn_in_tests perceives them as also being modified.

diff --git a/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction.js b/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction.js
index c725e65d58..14c4f439e1 100644
--- a/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction.js
+++ b/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction.js
@@ -19,6 +19,8 @@
  *   requires_sharding,
  *   uses_transactions,
  * ]
+ *
+ * This comment is to force the workload to run in burn_in_tests.
  */
 load('jstests/concurrency/fsm_libs/extend_workload.js');
 load('jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key.js');
diff --git a/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_update_transaction.js b/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_update_transaction.js
index 54b3f83f5f..029b13c7ff 100644
--- a/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_update_transaction.js
+++ b/jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key_broadcast_update_transaction.js
@@ -19,6 +19,8 @@
  *   requires_sharding,
  *   uses_transactions,
  * ]
+ *
+ * This comment is to force the workload to run in burn_in_tests.
  */
 load('jstests/concurrency/fsm_libs/extend_workload.js');
 load('jstests/concurrency/fsm_workloads/random_moveChunk_refine_collection_shard_key.js');

SERVER-47631 had been filed because a LockTimeout error response from the moveChunk command had been getting untentionally ignored. It likely has only been observed on the Enterprise RHEL 6.2 DEBUG Code Coverage build variant due to the extra instrumentation slowing down operations.

[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.189+0000 [tid:3] Ignoring acceptable moveChunk error: Error: command failed: {
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.189+0000 	"ok" : 0,
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	"errmsg" : "Unable to acquire X lock on '{12738665835288932546: Collection, 1209450789220462786, test21_fsmdb0.fsmcoll0_50}' within 30000ms. opId: 254013, op: MoveChunk, connId: 0.",
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	"code" : 24,
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	"codeName" : "LockTimeout",
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	"operationTime" : Timestamp(1586431315, 9),
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	"$clusterTime" : {
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 		"clusterTime" : Timestamp(1586431315, 9),
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 		"signature" : {
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 			"keyId" : NumberLong(0)
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 		}
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 	}
[fsm_workload_test:random_moveChunk_refine_collection_shard_key_broadcast_delete_transaction] 2020-04-09T11:21:55.190+0000 } : {

Generated at Thu Feb 08 05:14:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.