[SERVER-46733] Consider appending TransientTransactionError labels to ConflictingOperationInProgress errors Created: 09/Mar/20  Updated: 07/Apr/23  Resolved: 07/Apr/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Maria van Keulen Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Won't Do Votes: 0
Labels: ShardedTxn:FutureOptimizations, sharding-common-backlog
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding NYC
Participants:

 Description   

As part of my work for SERVER-44409, I ran into many ConflictingOperationInProgress errors, e.g.:

[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         Foreground jstests/concurrency/fsm_workloads/CRUD_and_commands.js
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         Error: command failed: {
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         	"ok" : 0,
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         	"errmsg" : "unable to initialize targeter for write op for collection test18_fsmdb0.fsmcoll0 :: caused by :: No chunks were found for the collection",
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         	"code" : 117,
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         	"codeName" : "ConflictingOperationInProgress",
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.516+0000         	"operationTime" : Timestamp(1583777991, 80),
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         	"$clusterTime" : {
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         		"clusterTime" : Timestamp(1583777991, 80),
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         		"signature" : {
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         			"keyId" : NumberLong(0)
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         		}
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         	}
[fsm_workload_test:CRUD_and_commands] 2020-03-09T18:19:52.517+0000         }

I encountered this error both inside transactions and outside of transactions. Per a discussion with jack.mulrow, in an aggressive concurrency workload with dropCollection in parallel with CRUD ops in sharding suites, it is possible to run into this kind of error even though CRUD ops and dropCollection take conflicting locks.

Can we consider adding a TransientTransactionError label when we encounter this error, to facilitate retrying? Conceptually, this seems like a similar case to the existing TransientTransactionError cases.



 Comments   
Comment by Max Hirschhorn [ 07/Apr/23 ]

It isn't expected for an application to be dropping the entire collection while concurrently performing transactions on it. And so, while it would be fine for the transaction to retry via the driver and the TransientTransactionError label in the server response, this is a scenario which is very unlikely to come up in production. We can re-evaluate if it ever becomes desirable for transactions to automatically retry when an index on a collection is dropped rather than failing with a non automatically retryable error like QueryPlanKilled.

Generated at Thu Feb 08 05:12:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.