[SERVER-81246] FLE WriteConcernError behavior unclear Created: 20/Sep/23  Updated: 06/Feb/24  Resolved: 05/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.3.0-rc0, 7.0.6

Type: Bug Priority: Major - P3
Reporter: Vishnu Kaushik Assignee: Erwin Pe
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-81280 Handle writeConcernErrors for FLE in ... Blocked
Issue split
split to SERVER-84081 FLE2 write error hides write concern ... Needs Scheduling
Related
related to SERVER-81259 updateOne without shard key does not ... Open
is related to SERVER-78311 mongos does not report writeConcernEr... Closed
Assigned Teams:
Server Security
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0
Sprint: Security 2023-11-13, Security 2023-11-27, Security 2023-12-11, Security 2023-12-25, Security 2024-01-08
Participants:

 Description   

This seems to happen on both mongos and mongod.

Here is an insert (non-FLE) that encounters a WriteConcernError (WCE). Note that n: 1 because the write went through, and the WCE is reported in the writeConcernError field.

{
	"n" : 1,
	"writeConcernError" : {
		"code" : 100,
		"codeName" : "UnsatisfiableWriteConcern",
		"errmsg" : "UnsatisfiableWriteConcern: Not enough data-bearing nodes; Error details: { writeConcern: { w: 3, wtimeout: 0, provenance: \"clientSupplied\" } } at shard-rs0",
		"errInfo" : {
			
		}
	},
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1695160343, 1),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1695160343, 1)
}

When using FLE, the WCE is placed into the writeErrors field. I'm not sure how drivers would then interpret the error. Note that the write doesn't go through either (n: 0)

{
 	"n" : 0,
 	"opTime" : Timestamp(1695160105, 3),
 	"writeErrors" : [
 		{
 			"index" : 0,
 			"code" : 64,
 			"errmsg" : "Write concern error committing internal transaction :: caused by :: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: 2, wtimeout: 2000, provenance: \"clientSupplied\" } }"
 		}
 	],
 	"ok" : 1,
 	"$clusterTime" : {
 		"clusterTime" : Timestamp(1695160105, 7),
 		"signature" : {
 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
 			"keyId" : NumberLong(0)
 		}
 	},
 	"operationTime" : Timestamp(1695160105, 3)
}

This led me to wonder what happens when an actual error, like DuplicateKeyError shows up along with a WCE. The result is that the WCE is hidden (this is basically the bug from SERVER-78311):

{
 	"n" : 0,
 	"opTime" : Timestamp(1695221129, 11),
 	"writeErrors" : [
 		{
 			"index" : 0,
 			"code" : 11000,
 			"errmsg" : "E11000 duplicate key error collection: bulk_fle.basic index: _id_ dup key: { _id: 1.0 } found value: RecordId(1)",
 			"keyPattern" : {
 				"_id" : 1
 			},
 			"keyValue" : {
 				"_id" : 1
 			},
 			"foundValue" : NumberLong(1),
 			"duplicateRid" : NumberLong(1)
 		}
 	],
 	"ok" : 1,
 	"$clusterTime" : {
 		"clusterTime" : Timestamp(1695221129, 13),
 		"signature" : {
 			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
 			"keyId" : NumberLong(0)
 		}
 	},
 	"operationTime" : Timestamp(1695221129, 11)
}

I'm looking to implement FLE + bulkWrite + WCE handling on mongos and I was looking into the existing behavior and that's when I found this.



 Comments   
Comment by Githook User [ 23/Jan/24 ]

Author:

{'name': 'Erwin Pe', 'email': 'erwin.pe@mongodb.com', 'username': 'erwee'}

Message: SERVER-81246 Fix how FLE2 writes report write concern errors

(cherry picked from commit 8dfda69bd26722b56656d5cfe31772c28818dc8a)

GitOrigin-RevId: cb3b77e0fc9581ca13f110bad6d85fc6d6b899df
Branch: v7.0
https://github.com/mongodb/mongo/commit/89e6525b0e732555336146d44c4fe73e69d4d081

Comment by Githook User [ 04/Jan/24 ]

Author:

{'name': 'Erwin Pe', 'email': 'erwin.pe@mongodb.com', 'username': 'erwee'}

Message: SERVER-81246 Fix how FLE2 writes report write concern errors

GitOrigin-RevId: 8dfda69bd26722b56656d5cfe31772c28818dc8a
Branch: master
https://github.com/mongodb/mongo/commit/7155d43c5a8e3a815857009170cbda565672f408

Comment by Erwin Pe [ 07/Dec/23 ]

Thanks for the clarification, Lingzhi & Vishnu. The fact that the write and the commit went through, yet we report n: 0 is indeed a serious issue, on top of the write concern error being in the writeErrors. Re-opening this ticket for a closer look.

Comment by Vishnu Kaushik [ 07/Dec/23 ]

Thanks for explaining Lingzhi.

Here's a series of steps that demonstrates why the current behavior is wrong:
1. Perform an FLE write that gets a WC error
2. The server will respond with n: 0 in today's implementation
3. Perform a read on the server - we will be able to see the write from Step 1, though in Step 2 the server responded with n: 0.

Comment by Lingzhi Deng [ 07/Dec/23 ]

I don't think the internal transaction API can abort the transaction on writeConcern errors. The internal transaction API acts like a driver where it would automatically retry the commit statement on retryable errors (incl. retryable writeConcern errors). But by that time, it is already too late because the internal transaction API should have already issued the "commit". At that point, the write was already done and the only thing we know is the write didn't satisfy the specified writeConcern. So I think if we return that writeConcern as a writeError and claim nothing was done, that's not necessarily true. A subsequent read could see the previous write even if it failed writeConcern. So I think I agree with Vishnu that we should return any writeConcern errors as writeConcern error, not write error.

Comment by Erwin Pe [ 07/Dec/23 ]

As it is currently designed, Queryable Encryption write operations handle the write concern through the transaction API during the internal transaction commit. The transaction API honors the write concern on commit, but if it's not satisfiable, then it aborts the transaction and returns a UnsatisfiableWriteConcern in the effective commit status. The FLE CRUD code, in turn, adds this error in the `writeErrors` section of the response.

This is very different from how write concern is handled in regular write operations, where it's the service entry point that waits for the write concern and attaches any WCEs into the final command reply. Normally, writes can still succeed (n > 0) with a WCE, but in FLE2, a write concern error is considered a write error because the entire write is always aborted if the WC can't be satisfied.

Generated at Thu Feb 08 06:45:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.