[SERVER-66116] Aborted Read with MongoNotPrimaryException Created: 02/May/22  Updated: 03/Oct/22

Status: Blocked
Project: Core Server
Component/s: None
Affects Version/s: 4.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kyle Kingsbury Assignee: Matthew Russotto
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 20220502T122637.000-0400.zip    
Issue Links:
Depends
depends on DRIVERS-2327 Propagate Original Error for Write Er... Implementing
depends on SERVER-66479 Create an error label indicating if a... Closed
Related
Operating System: ALL
Steps To Reproduce:

Grab a Jepsen environment with five nodes and https://github.com/jepsen-io/mongodb at da4a3fcef9298b4658db435991a402afe7497f00, then run (e.g.):

lein run test --nodes-file ~/nodes -w list-append -r 1000 --concurrency 3n --max-writes-per-key 16 --read-concern majority --write-concern majority --txn-read-concern snapshot --txn-write-concern majority --time-limit 300 --nemesis partition --test-count 5

 

Sprint: Repl 2022-05-16, Repl 2022-05-30, Repl 2022-06-13, Repl 2022-06-27, Repl 2022-07-11, Repl 2022-08-08, Repl 2022-08-22, Repl 2022-09-05, Repl 2022-09-19, Repl 2022-07-25, Repl 2022-10-03
Participants:

 Description   

It looks like MongoNotPrimaryException (or whatever the protocol response is that triggers this error in the Java driver) might actually be an indefinite error, rather than a definite failure. Consider this pair of operations from a Jepsen list-append test:

 

{:type :fail, :f :txn, :value [[:append 855 3]], :time 36272337272, :process 36, :error :not-primary, :index 56335}
{:type :ok, f :txn, value [[:r 855 [3]]], time 38283284542, process 42, index 57897}, 

In this case both "transactions" are actually single-document operations. The first operation performs a single findAndModify to $push the number 3 onto a list in document 855; that write threw a MongoNotPrimaryException. The second is a read of document 855, which observed that write of 3.

The documentation for MongoNotPrimaryException says that the server "refused to execute... a write operation", which seems fairly plain: the write of 3 must not have happened. Since we go on to read 3, this looks like an aborted read.

This problem occurs with MongoDB 4.4.9 and Java driver 4.6.0, write concern majority, read concern snapshot/majority, and is reproducible using network partitions.

It also looks like MongoWriteConcernWithResponseException with a message containing "InterruptedDueToReplStateChange" may also do the same thing, but I'm less sure whether this error should be interpreted as definite or not.



 Comments   
Comment by Judah Schvimer [ 27/Jun/22 ]

The next step on this ticket is to define the drivers spec changes needed to address this issue, based on the error label added in SERVER-66479. Marking this ticket as blocked on DRIVERS-2327. After that is completed with the required changes in the drivers, we will close this issue.

Comment by Cristopher Stauffer [ 17/May/22 ]

Linking SERVER-66479 as required to address this ticket. 

Comment by Cristopher Stauffer [ 13/May/22 ]

aphyr@jepsen.io, thank you for reporting this issue. We were able to reproduce the issue using your steps. For the scenario you outlined on MongoDB 4.4.9 and Java Driver 4.6.0,  we are in fact not providing the correct error with regards to it being definite or indefinite. Additionally, we were able to see that in earlier versions of the Java Driver the behavior expected by the Jepsen tests did occur. We are going to be scheduling an update to our driver specification to return an indefinite error in any cases where an indefinite error could occur including the list-append scenario you provided. We will link this ticket to the associated Driver work: DRIVERS-2327. We actively test with Jepsen as part of our regression testing, and we will be reviewing our test matrix to capture this combination in the future.

Comment by Matthew Russotto [ 06/May/22 ]

We are currently continuing to actively investigate this issue.

Comment by Eric Sedor [ 02/May/22 ]

Unfortunately that's right about editing descriptions; I've made that edit and we'll take a look at this. Thanks, Kyle!

Comment by Kyle Kingsbury [ 02/May/22 ]

Argh, is there really no edit button for Jira issues? "single-document reads" should be "single-document operations".

Generated at Thu Feb 08 06:04:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.