[SERVER-73967] Update handling of create command in shouldRetryWithNetworkErrorOverride Created: 13/Feb/23  Updated: 24/Aug/23

Status: Blocked
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Kaitlin Mahar Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-60064 Make create command idempotent on mongod Closed
depends on SERVER-76547 Create command on a time-series colle... Closed
Assigned Teams:
Storage Execution
Sprint: Execution Team 2023-05-01
Participants:

 Description   

In jstests/libs/override_methods/network_error_and_txn_override.js, there is logic in the shouldRetryWithNetworkErrorOverride method such that, in the case where a create command returns an ok: 0 response with the NamespaceExists error code, the error is swallowed and ok: 1 is returned.
The reasoning for this is so that the command can be safely retried in cases where it is unknown if it actually succeeded, due to encountering a network error.

With the create command becoming idempotent on mongos as of SERVER-60064, the logic in this file should be updated to no longer override the ok value, since the command should now be safely retryable.

The current behavior causes issues in test cases added in SERVER-60064 that expect NamespaceExists errors in cases where create is re-run with different options from the existing collection/view. The legitimate errors are incorrectly swallowed by this logic, which prevents the tests from being run in stepdown suites.

Along with updating the override logic, we should remove the does_not_support_stepdowns tag from jstests/core/views/view_creation.js and jstests/core/ddl/create_collection.js.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 26/Apr/23 ]

This work will need to be rescheduled for post 8.0, assuming SERVER-76547 is completed before then.

Comment by Dianna Hohensee (Inactive) [ 26/Apr/23 ]

I don't think this work can be done yet, because the create command doesn't appear to be idempotent for time-series collections.

Relevant test failure details (notice that the original command included is a create for a time-series collection, then the error is because there's a time-series view on a buckets collection – the error doesn't realize it's time-series)

[js_test:timeseries_metric_index_compound] 2023-04-21T18:16:15.043Z assert: command failed: {
[js_test:timeseries_metric_index_compound] 	"ok" : 0,
[js_test:timeseries_metric_index_compound] 	"errmsg" : "namespace test.timeseries_metric_index_compound already exists, but is a view on test.system.buckets.timeseries_metric_index_compound rather than test",
[js_test:timeseries_metric_index_compound] 	"code" : 48,
[js_test:timeseries_metric_index_compound] 	"codeName" : "NamespaceExists",
[js_test:timeseries_metric_index_compound] 	"$clusterTime" : {
[js_test:timeseries_metric_index_compound] 		"clusterTime" : Timestamp(1682100974, 67),
[js_test:timeseries_metric_index_compound] 		"signature" : {
[js_test:timeseries_metric_index_compound] 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[js_test:timeseries_metric_index_compound] 			"keyId" : NumberLong(0)
[js_test:timeseries_metric_index_compound] 		}
[js_test:timeseries_metric_index_compound] 	},
[js_test:timeseries_metric_index_compound] 	"operationTime" : Timestamp(1682100974, 67)
[js_test:timeseries_metric_index_compound] } with original command request: {
[js_test:timeseries_metric_index_compound] 	"create" : "timeseries_metric_index_compound",
[js_test:timeseries_metric_index_compound] 	"timeseries" : {
[js_test:timeseries_metric_index_compound] 		"timeField" : "tm",
[js_test:timeseries_metric_index_compound] 		"metaField" : "mm"
[js_test:timeseries_metric_index_compound] 	},
[js_test:timeseries_metric_index_compound] 	"lsid" : {
[js_test:timeseries_metric_index_compound] 		"id" : UUID("4748b47a-6455-4017-a402-59816d1b6ffc")
[js_test:timeseries_metric_index_compound] 	},
[js_test:timeseries_metric_index_compound] 	"$clusterTime" : {
[js_test:timeseries_metric_index_compound] 		"clusterTime" : Timestamp(1682100974, 62),
[js_test:timeseries_metric_index_compound] 		"signature" : {
[js_test:timeseries_metric_index_compound] 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[js_test:timeseries_metric_index_compound] 			"keyId" : NumberLong(0)
[js_test:timeseries_metric_index_compound] 		}
[js_test:timeseries_metric_index_compound] 	},
[js_test:timeseries_metric_index_compound] 	"writeConcern" : {
[js_test:timeseries_metric_index_compound] 		"w" : "majority",
[js_test:timeseries_metric_index_compound] 		"wtimeout" : 300321
[js_test:timeseries_metric_index_compound] 	}
[js_test:timeseries_metric_index_compound] } on connection: connection to localhost:21000
[js_test:timeseries_metric_index_compound] _getErrorWithCode@src/mongo/shell/utils.js:24:13
[js_test:timeseries_metric_index_compound] doassert@src/mongo/shell/assert.js:18:14
[js_test:timeseries_metric_index_compound] _assertCommandWorked@src/mongo/shell/assert.js:766:25
[js_test:timeseries_metric_index_compound] assert.commandWorked@src/mongo/shell/assert.js:860:16
[js_test:timeseries_metric_index_compound] testBadIndex@jstests/core/timeseries/timeseries_metric_index_compound.js:179:16
[js_test:timeseries_metric_index_compound] @jstests/core/timeseries/timeseries_metric_index_compound.js:185:17
[js_test:timeseries_metric_index_compound] run@jstests/core/timeseries/libs/timeseries.js:203:15
[js_test:timeseries_metric_index_compound] @jstests/core/timeseries/timeseries_metric_index_compound.js:23:16
[js_test:timeseries_metric_index_compound] @jstests/core/timeseries/timeseries_metric_index_compound.js:206:2

This test failure is possible because the test ran in replica_sets_terminate_primary_jscore_passthrough, where the primary was stepped down causing an InterruptedDueToReplStateChange error and the create command is retried.

[js_test:timeseries_metric_index_compound] =-=-=-= Retrying write concern error response with retryable code :: create, CommandID: 472, error: {  "writeConcernError" : {  "code" : 11602,  "codeName" : "InterruptedDueToReplStateChange",  "errmsg" : "operation was interrupted",  "errInfo" : {  "writeConcern" : {  "w" : "majority",  "wtimeout" : 300321,  "provenance" : "clientSupplied" } } },  "ok" : 1,  "topologyVersion" : {  "processId" : ObjectId("6442d2d839878a7c9b4114a1"),  "counter" : NumberLong(7) },  "$clusterTime" : {  "clusterTime" : Timestamp(1682100974, 65),  "signature" : {  "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),  "keyId" : NumberLong(0) } },  "operationTime" : Timestamp(1682100974, 65) }, command: {  "create" : "timeseries_metric_index_compound",  "timeseries" : {  "timeField" : "tm",  "metaField" : "mm" },  "lsid" : {  "id" : UUID("4748b47a-6455-4017-a402-59816d1b6ffc") },  "$clusterTime" : {  "clusterTime" : Timestamp(1682100974, 62),  "signature" : {  "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),  "keyId" : NumberLong(0) } },  "writeConcern" : {  "w" : "majority",  "wtimeout" : 300321 } }

This is the test failure line. And this is the create command error.

Generated at Thu Feb 08 06:26:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.