[SERVER-31142] Implement failpoint for testing retryable, multi-statement writes Created: 18/Sep/17  Updated: 27/Oct/23  Resolved: 18/Oct/17

Status: Closed
Project: Core Server
Component/s: Write Ops
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Jeremy Mikola Assignee: Jack Mulrow
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-29606 Implement failpoint to aid testing re... Closed
Sprint: Sharding 2017-10-02, Sharding 2017-10-23
Participants:

 Description   

SERVER-29606 implemented a onPrimaryTransactionalWrite fail point, but its API does not seem suitable for testing retryable behavior for multi-statement write operations, such as this test from the driver spec:

# description: "BulkWrite succeeds after one network error for each command"
# Fail point will repeat 5 times, but only the first attempt of each
# write command will yield a network error. This will allow for the
# following exchange:
#
#  1. delete fails
#  2. delete is retried and succeeds
#  3. insert fails
#  4. insert is retried and succeeds
#  5. update fails (fail point deactivates)
#  6. update is retried and succeeds

In this test, the driver's bulkWrite() CRUD method is called with an ordered sequence of delete, insert, and update operations. This will be executed as three write commands. The test suite needs the ability to configure a fail point so that only the first attempt of each of these three write commands fails, in order to verify that the driver successfully retries individual write commands once to allow the entire bulkWrite() to succeed.

Per kaloian.manassiev's comments:

We need to allow combination of skip + nTimes (right now they are separate), or add a failpoint for the entire command (as opposed to the separate commits that happen within a command).



 Comments   
Comment by Jeremy Mikola [ 18/Oct/17 ]

Closing per my comment in SPEC-966. Drivers can use the fail point as-is to test what we need to.

Comment by Jeremy Mikola [ 11/Oct/17 ]

Not sure what you mean by "tied to transaction IDs"

Sorry, I meant to write "tied to statements." I think my misunderstanding stemmed from overlooking "as opposed to the separate commits that happen within a command" in your original comment (quoted in the issue description).

Please see my comment in SPEC-966. I believe we can use skip to delay the fail point until the last statement in a multi-statement write command. If so, I think we can make do with the current design.

Comment by Kaloian Manassiev [ 11/Oct/17 ]

Not sure what you mean by "tied to transaction IDs". It just specifies what code to throw when we try to commit one operation from a batch of writes.

Comment by Jeremy Mikola [ 11/Oct/17 ]

kaloian.manassiev: I think so. shane.harvey explained to me how failBeforeCommitExceptionCode works (I didn't realize it was tied to transaction IDs). Let me make some changes to the retryable write spec tests to take advantage of that option and times and I'll follow up once we confirm we can get by with the existing fail point design.

Comment by Kaloian Manassiev [ 11/Oct/17 ]

Thanks, shane.harvey.

jmikola, does what Shane is describing work for your use case and can we close this ticket?

Comment by Shane Harvey [ 10/Oct/17 ]

The onPrimaryTransactionalWrite fail point can already fail the first attempt of each consecutive write command:

MongoDB shell version v3.4.6
connecting to: mongodb://127.0.0.1:27017
MongoDB server version: 3.5.13-327-ga34daf8
WARNING: shell and server versions do not match
Server has startup warnings:
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten]
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten] ** NOTE: This is a development version (3.5.13-327-ga34daf8) of MongoDB.
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten] **       Not recommended for production.
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten]
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten] ** WARNING: Access control is not enabled for the database.
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten] **          Read and write access to data and configuration is unrestricted.
2017-10-10T11:49:41.549-0700 I CONTROL  [initandlisten]
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.insert({_id:1})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.insert({_id:2})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> var txnCmd = {'delete': 'test', 'txnNumber': NumberLong(1), 'lsid': {'id': BinData(4, "nGmSpxKoTcWrYclQh5xF0A==")}, 'deletes': [{'q': {'_id': 1}, 'limit': 1}], 'ordered': true}
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.adminCommand({"configureFailPoint":  "onPrimaryTransactionalWrite", mode: "alwaysOn"})
{ "ok" : 1, "operationTime" : Timestamp(1507661395, 1) }
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.runCommand(txnCmd);
2017-10-10T11:50:06.937-0700 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'delete' on host '127.0.0.1:27017'  :
DB.prototype.runCommand@src/mongo/shell/db.js:132:1
DBCollection.prototype._dbCommand@src/mongo/shell/collection.js:173:1
@(shell):1:1
2017-10-10T11:50:06.939-0700 I NETWORK  [thread1] trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2017-10-10T11:50:06.941-0700 I NETWORK  [thread1] reconnect 127.0.0.1:27017 (127.0.0.1) ok
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.runCommand(txnCmd);
{
	"n" : 1,
	"opTime" : {
		"ts" : Timestamp(1507661406, 1),
		"t" : NumberLong(1)
	},
	"electionId" : ObjectId("7fffffff0000000000000001"),
	"ok" : 1,
	"operationTime" : Timestamp(1507661406, 1)
}
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.insert({_id:1})
WriteResult({ "nInserted" : 1 })
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> txnCmd['txnNumber'] = NumberLong(2)
NumberLong(2)
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.runCommand(txnCmd);
2017-10-10T11:50:26.353-0700 E QUERY    [thread1] Error: error doing query: failed: network error while attempting to run command 'delete' on host '127.0.0.1:27017'  :
DB.prototype.runCommand@src/mongo/shell/db.js:132:1
DBCollection.prototype._dbCommand@src/mongo/shell/collection.js:173:1
@(shell):1:1
2017-10-10T11:50:26.355-0700 I NETWORK  [thread1] trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2017-10-10T11:50:26.355-0700 I NETWORK  [thread1] reconnect 127.0.0.1:27017 (127.0.0.1) ok
MongoDB Enterprise dc28deb5-a14d-4bb8-9ac7-c40eeab579dd:PRIMARY> db.test.runCommand(txnCmd);
{
	"n" : 1,
	"opTime" : {
		"ts" : Timestamp(1507661426, 1),
		"t" : NumberLong(1)
	},
	"electionId" : ObjectId("7fffffff0000000000000001"),
	"ok" : 1,
	"operationTime" : Timestamp(1507661426, 1)
}

I think this ticket can be closed.

Comment by Jeremy Mikola [ 06/Oct/17 ]

kaloian.manassiev: Note that in the variant jeff.yemin shared, we effectively need to interleave skips and failures. I'm not sure that "a combination of skip + nTimes" alone would do the trick. This is why I originally proposed an errorOnFirstAttemptOnly option in SERVER-29606.

Comment by Jeffrey Yemin [ 21/Sep/17 ]

Another variant which is required:

1. insert fails
2. insert is retried and succeeds
3. insert fails
4. insert is retried and succeeds

As this lets clients test retries during batch splitting

Generated at Thu Feb 08 04:26:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.