[SERVER-29606] Implement failpoint to aid testing retryable writes Created: 13/Jun/17  Updated: 30/Oct/23  Resolved: 14/Sep/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6.0-rc0

Type: New Feature Priority: Major - P3
Reporter: Jeremy Mikola Assignee: Kaloian Manassiev
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-31142 Implement failpoint for testing retry... Closed
is related to DOCS-10784 Restore documentation for configureFa... Backlog
is related to SERVER-34057 Update onPrimaryTransactionalWrite fa... Closed
Backwards Compatibility: Fully Compatible
Sprint: Sharding 2017-10-02
Participants:

 Description   

As requested in the retryable writes driver spec, a failpoint that triggers the server to close a socket after applying a write operation will be helpful for testing retryable writes.

Drivers will only attempt to retry writes if they encounter a network error, so having a failpoint return a failure result for the write command would not be sufficient.



 Comments   
Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

Author:

{'username': u'kaloianm', 'name': u'Kaloian Manassiev', 'email': u'kaloian.manassiev@mongodb.com'}

Message:SERVER-29606 Introduce 'onPrimaryTransactionalWrite' failpoint
Branch:master
https://github.com/mongodb/mongo/commit/4a415f8e251a8ca6de382b5c68601fe52b15aaeb

Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

Author:

{'username': u'kaloianm', 'name': u'Kaloian Manassiev', 'email': u'kaloian.manassiev@mongodb.com'}

Message:SERVER-29606 Add a 'skip' failpoint option

This option allows the failpoint's effect to be skipped for up to a
certain amount of checks.
Branch:master
https://github.com/mongodb/mongo/commit/b0abd3f0b318e61837c2853a21b0ffde0bf8639f

Comment by A. Jesse Jiryu Davis [ 13/Sep/17 ]

Looks great, thanks!

Comment by Kaloian Manassiev [ 13/Sep/17 ]

For our internal testing we needed a little bit more elaborate failpoint so instead of disconnectAfterWrite I implemented a more generic onPrimaryTransactionalWrite one with the following parameters which can be specified through the data:

  • closeConnection (bool, default = true): Closes the connection on which the write was executed.
  • failBeforeCommitExceptionCode (int, default = not specified): If set, the specified exception code will be thrown, which will cause the write to not commit; if not specified, the write will be allowed to commit.
    Both options can be combined together.

jmikola, jesse, jeff.yemin: I believe this will suit your needs for driver testing as well, but let me know if you have any problems with it.

Comment by Jeremy Mikola [ 24/Jul/17 ]

In mongodb/specifications#156, jeff.yemin highlighted a need to test retryable behavior from a failed write command in the middle of a split batch (e.g. insertMany() with several 8MB documents split into a series of insert commands). This will require the fail point to support a skip option to defer triggering on the first X commands. For example:

db.runCommand({
    configureFailPoint: "disconnectAfterWrite",
    mode: { times: <integer> },
    data: { skip: 2 }
});
 
db.collection.insertMany([ tenMegabyteDoc, tenMegabyteDoc, tenMegabyteDoc ]);

The insertMany() operation would be split into three insert commands and the fail point will trigger on the third and final command in the sequence. With a skip option, we can only test retryable behavior for the first insert command in the sequence.

Is this functionality we can expect for the fail point? If not, I'm open to another proposal that would allow the above test case to be exercised.

Comment by Jeremy Mikola [ 20/Jun/17 ]

We're currently prototyping driver tests for retryable writes in mongodb/specifications#156 and are expecting the following API:

The tests depend on a server fail point, disconnectAfterWrite, which allows us to force a network error after the server executes a write command but before it would return a write result. The fail point may be configured like so:

db.runCommand({
    configureFailPoint: "disconnectAfterWrite",
    mode: { times: <integer> },
    data: { errorOnFirstAttemptOnly: <boolean> }
});

The times option is a generic fail point option and specifies the number of times that the fail point remains on before it deactivates. Each write command will count towards the times limit regardless of whether a network error occurs.

The errorOnFirstAttemptOnly option defaults to false, but may be set to true to have the fail point allow retry attempts to succeed. This is needed to test a bulk write operation consisting of multiple write commands, so that we can expect one network error for each command in the batch and still expect the entire bulk write operation to succeed.

Generated at Thu Feb 08 04:21:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.