[SERVER-29861] Insert should report number of docs inserted or writeConcernError on error code 11602 (InterruptedDueToReplStateChange) Created: 26/Jun/17  Updated: 27/Oct/23  Resolved: 17/Jan/18

Status: Closed
Project: Core Server
Component/s: Write Ops
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Daniel Nichter Assignee: Asya Kamsky
Resolution: Works as Designed Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test1-data.csv     Text File test1.log     File test2-data.csv     Text File test2.log    
Issue Links:
Related
is related to DOCS-11076 Document cases where bulk inserts may... Closed
Backwards Compatibility: Fully Compatible
Participants:

 Description   

According to the docs and my understanding of relevant things (e.g. write concerns), I take the following to be true:

Insert should always return the number of documents inserted ("n") that satisfied the write concern. (Where "always" implies that the original MongoDB server is still alive and reachable. If the server or network or etc. dies, then of course there's nothing left to report "n".)

This should hold true especially when the repl set state changes because this is a common and useful event. Apps should expect the primary to go away, but they should also be guaranteed that MongoDB will cleanly terminate pending ops wrt the write concern (total server/network/etc. failure notwithstanding).

If the aforementioned is not true, then perhaps there's no bug and https://docs.mongodb.com/manual/reference/command/insert/ just needs to be updated to reflect the real guarantees and return values. But presuming it's true, I put it to the test and found a problem: insert sometimes does not return the number of docs inserted (or a writeConcernError) on error code 11602 (InterruptedDueToReplStateChange).

Setup:

  • MongoDB 3.4.4
  • 3-node repl set on my laptop
  • journaling filesystem (Mac HFS)
  • mgo driver @3f83fa
  • write concern: w:2, j:true

Experiment:

  • Insert array of 100,000 docs (internally: array of 1,000 docs at a time)
  • While the insert is happening, run rs.stepDown(10,5) on the primary
  • Compare docs written ("n") from mgo vs. "nRemoved" from db.docs.remove({});

Expect:

Results:

Results vary. First of all, there are 2 possible errors in the client (mgo): an EOF on the network socket (because Mongo kills connections on rs state change), or error code 11602 (InterruptedDueToReplStateChange).

On EOF, there can't be any return write result (i.e. the documented output) because the connection is closed. I'm not sure if this a problem in mgo (not doing a final read?) or Mongo (killing the connection before sending a final write result?) Either way, the problem arises on occasion: mgo will report N docs written, but N + some random number of docs were actually written.

The 2nd error condition is my focus. Mongo seems to gracefully kill the write op, but it does not always return the documented output. First, an example of when it works:

{
  "n": 1000,
  "opTime": {
    "ts": {
      "$timestamp": {
        "t": 1498494120,
        "i": 5000
      }
    },
    "t": {
      "$numberLong": "202"
    }
  },
  "electionId": {
    "$oid": "7fffffff00000000000000ca"
  },
  "writeConcernError": {
    "code": 11602,
    "codeName": "InterruptedDueToReplStateChange",
    "errmsg": "operation was interrupted"
  },
  "ok": 1
}

That's a bsondump of the raw Go []byte (I hacked this into my local copy of mgo). That is what I'd expect every time; it matches the documented output (plus some other, undocumented fields).

But sometimes the output is only:

{
  "ok": 0,
  "errmsg": "operation was interrupted",
  "code": 11602,
  "codeName": "InterruptedDueToReplStateChange"
}

And in this particular case, mgo reports 14,000 docs written, but:

rs0:PRIMARY> db.docs.remove({});
WriteResult({ "nRemoved" : 14320 })

The critical question: is MongoDB returning a write result with n=320? If yes, then it's a bug in mgo. I don't think this is the case because I'm writing the raw []byte from deep inside mgo before the data is returned up the call stack. Inside the mgo code here https://github.com/go-mgo/mgo/blob/v2-unstable/socket.go#L625, I added:

fileNo++
file := fmt.Sprintf("res-%d.bson", fileNo)
ioutil.WriteFile(file, b, 0666)

Then I bsondump the last file written, which is the doc shown above. Furthermore, this works most of the time, but sometimes it doesn't, which leads me to believe mgo is not missing a final read on the network connection; rather, Mongo is not sending a write result with n: 320 in the example above.

If that's true, this is a show-stopping problem because it means an app cannot know, even with a write concern, how many docs were actually inserted. Moreover, the app might presume a wrong number (because it trusts the driver which doesn't know either), as in this case: 14,000 reported vs. 14,320 actual.

Happy to provide more details, tweak my experiment, whatever. Time allowing, I'll automate the whole experiment so I can run it 1,000+ times to measure accurately the failure rate. Right now, doing it by hand, the rough estimate is 10-40% failure rate.



 Comments   
Comment by Ian Whalen (Inactive) [ 26/Apr/19 ]

Switching "Drivers Changes Needed" from "Maybe" to "Not Needed" since this was closed as something other than Fixed.

Comment by Asya Kamsky [ 17/Jan/18 ]

Documentation improvements will be tracked in DOCS-11076

Comment by Daniel Nichter [ 06/Jan/18 ]

Thanks for the in-depth discussion and exploration of this issue. You've convinced me: it's not a bug. I think the salient point is what you said: "only the number of documents that definitely were written and replicated (i.e. satisfied write concern) is returned, but more documents may have been written that are not included in "n" because there is no proof that they were replicated before the connection was lost." I think the manual should say that because when I read this:

"BulkWriteResult.nInserted
The number of documents inserted using the Bulk.insert() method."
https://docs.mongodb.com/manual/reference/method/BulkWriteResult/#BulkWriteResult.nInserted

I take that at face value. My misunderstanding would have been adverted if the docs said what you said. Then I would know: on write, I'll get >= n docs definitely written (wrt write concern). Then the question becomes how, from the app dev point of view, do I handle the ">" case? That depends on the app, docs, indexes, etc.

Wrt the inconsistent return value on error (error code 11602): consistency is great, but inconsistency is not unusual. MySQL can report 1 of 3 different errors when the server goes away. A really well-written app will handle all three. I think with MongoDB, devs will wind up doing the same: learning all possible "lost connection/server went away" errors and handling all of them.

Comment by Asya Kamsky [ 18/Dec/17 ]

It was pointed out to me that the official MongoDB CRUD specification document directs that on a write error the fields that should be returned are ones describing the error, but not "n" indicating how many documents were actually written, which only gets returned when there has not been an error (and an acknowledgement is requested).

This suggests that mgo driver should probably not be reporting "n" at all in the case of a write concern error.

Linked DOCS-11076 should document how to interpret results from the driver in cases of an error.

Comment by Asya Kamsky [ 11/Dec/17 ]

I presume the intention is for "n" to be correct.

In a distributed system it is not possible to guarantee that "n" is correct when "n" depends on hearing from other parts of the system.

Imagine a scenario with two nodes, where a write of a document is sent to one but must be acknowledged only when it has replicated to the second node in the system. If the primary successfully writes the document but does not receive the acknowledgement from the second node, should it acknowledge to the client that one document was written, or zero? One was written locally, but zero satisfied the "written to both nodes" request (that it knows of). From point of view of the primary, how many documents were written on the second node? It has no way of knowing.

Since it cannot know that the document was written on the secondary, it must consider zero as the only possible correct response. But it's possible that the document was successfully written on the second node, even though knowledge of that fact never reached the primary.

Insert should always return the number of documents inserted ("n") that satisfied the write concern.

is consistent with the behavior you observe - only the number of documents that definitely were written and replicated (i.e. satisfied write concern) is returned, but more documents may have been written that are not included in "n" because there is no proof that they were replicated before the connection was lost.

Note that analogous to the two node scenario is the communication between the server and the driver - if the driver does not receive acknowledgement from the server that a document was written it must not return to the application that the document was written, but nevertheless the document may have actually been written successfully.

This behavior is not a bug.

There are many changes we are working on to improve the situation for the application - for example, the automatic re-try of document writes by the driver improves the situation for the application since it doesn't need to know how to re-try the failed write in a way that's idempotent. (Not all drivers already support this new server feature, but that work is tracked in DRIVERS-396). There may be other improvements in the server and/or in the driver(s) that could reduce the number of scenarios in which the number of successfully written documents is lower than the actual number of written documents. But it is impossible for that scenario to be completely eliminated unless we returned no "n" in cases where we cannot be sure that more documents than "n" were successfully written.

the other factor is that error code 11602 is sometimes sent, and sometimes not.

It wasn't clear from the ticket whether there was potentially a server or a driver bug here. It's probably best to investigate that in a separate ticket.

Comment by Daniel Nichter [ 08/Dec/17 ]

The current behavior is not a bug? Without knowing the underlying code, I presume the intention is for "n" to be correct. However, if "n" was not intended to be correct, i.e. to be only an estimate or best effort count, then I could see how it's not a bug.

Also, the other factor is that error code 11602 is sometimes sent, and sometimes not. For this, I'd argue the same: if that's the original code intention, then it's not a bug (although very strange behavior for a database). But if the code intention is to always return this, then sometimes not sending it is a bug.

Comment by Asya Kamsky [ 08/Dec/17 ]

> I think the real issue requires more than doc updates.

Of course, that's why this is still open, but as an improvement request.

One workaround is to use individual document operations rather than batch operations.

Some upcoming features may also help, but they would have to be handled correctly in the driver as well as the server.

Comment by Daniel Nichter [ 07/Dec/17 ]

Hello! Thanks for the reply and looking into this!

I think the real issue requires more than doc updates.

Imho, the goal is this: MongoDB should always report the exact number of documents inserted, satisfying the write concern, under normal operation.

Imho again, anything less than that makes the database virtually unusable because if an application cannot know, with certainty, how many docs were inserted, app development becomes very difficult. A database is often the app's source of true; it must not lie to the app wrt the data it has or hasn't stored.

In the test data we see, for example, that Mongo will report 66,000 docs inserted when in reality 66,448 docs were inserted. Depending on what the app and data are used for, that could be a huge and important difference.

If we accept the current behavior (doc updates notwithstanding), apps are forced not to trust the database, and they're forced to have complicated logic to verify and reconcile what the database reports ("n") vs. reality. Moreover, because data is always in flight, it's not logical to presume the app can successfully reconcile the data short of taking the database offline.

Re documentation for "the server to provide more information about the number of documents successfully written when interrupted in the middle of a batch". There's two things here. Re "middle of a batch": imho, this in an implementation detail that should not matter to the user/app. The app does not and should not care how MongoDB reliably stores a bunch of docs wrt the write concern. Second, the "number of docs success written" is already documented: it's "n". Well, the docs say "The number of documents inserted." which a user presumes to mean "successfully written wrt the write concern under normal ops". Again, the user/app doesn't care about the details (under normal operation); they should simply be able to trust that if the database says N docs written, then exactly N docs have been written.

I think the salient question to agree or disagree on is: Should "n", wrt write concern under normal ops, always be the exact number of docs written? If "no", then doc updates are in order, but (imho) such behavior seriously undermines the usability of the database because apps won't be able to trust it. If "yes", then fixing this bug will be a huge win, imho.

Comment by Asya Kamsky [ 23/Nov/17 ]

The behavior you observe is due to the fact that the writeConcern is only checked at the end of every "batch".

This is why in the driver you see report based on multiples of 1,000 (which is the size of batches larger requests are split into).

Your suggestion that documentation is implying that

"Insert should always return the number of documents inserted ("n") that satisfied the write concern. (Where "always" implies that the original MongoDB server is still alive and reachable. If the server or network or etc. dies, then of course there's nothing left to report "n".)"

suggests that documentation improvements are in order to make it more clear in what scenario the number of documents successfully acknowledged will be returned.

Since this is expected behavior, I will convert this issue to an enhancement request for the server to provide more information about the number of documents successfully written when interrupted in the middle of a batch.

Comment by Ian Whalen (Inactive) [ 03/Aug/17 ]

spencer any thoughts from the repl team on the severity/priority of this?

Comment by Daniel Nichter [ 18/Jul/17 ]

Excellent, thanks Mark! This will be a big win for app dev and ops.

Comment by Mark Agarunov [ 17/Jul/17 ]

Hello dnichter,

Thank you for the report and my apologies for the delay in response. I've been able to reproduce the behavior you've described using the detailed examples and code provided so I've placed this ticket in "Need Triage" to be scheduled against our currently planned work.

Thanks,
Mark

Comment by Daniel Nichter [ 04/Jul/17 ]

Re attached files: test1-data.csv, test1.log, test2-data.csv, test2.log

TL;DR

Details

  • MongoDB 3.4.4
  • 6 node repl set: 3 nodes on US west; 3 nodes on US east
  • Custom write concern called "twoack" that requires 2 US west nodes to ack (primary + one other)
  • US east nodes are priority zero
  • Bare metal, dedicated, high-end production hardware (only MongoDB running on all 6 servers)
  • Using https://github.com/daniel-nichter/lab/tree/master/server-29861 with command line like:

./server-29861 -username root -password test -auth-db admin -safe-w 0 -safe-wmode twoack -tests 100 mongodb://localhost?replicaSet=rs0 1>data.csv 2>log

Two experiment runs with 100 tests each (1 test = the full cycle: insert, stepDown, recover, remove, count, print data).

Data Fields

The .csv files are the useful data, with fields:

nReported,nActual,secondsBeforeStepDown,errorString,pass

Where

  • nReported = the number of docs written ("n") as reported by the modified, vendored-in copy of [mgo](http://labix.org/mgo). You can see my mods at/btween @dn code comments. Basically, it just makes mgo actually report "n" from MongoDB (which is a separate issue for mgo).
  • nActual = the number of docs removed as reported by MongoDB, after the repl set has recovered, by doing db.remove() on the (new) primary. This number is authoritative, how many docs are truly stored in the collection.
  • secondsBeforeStepDown = random wait between starting insert and doing rs.stopDown on master, [500ms, 2).
  • errorString = "int" = "operation was interrupted", "eof" = "EOF" (socket), "aok" = all docs inserted before secondsBeforeStepDown (rare given 250k docs are inserted)
  • pass = "true" or "false" depending on if nReported = nActual

The line number of the data file = the test number (important to match results with log).

Results

So the pass/fail rate is easy to calculate from the data:

$ wc -l test1-data.csv 
     100 test1-data.csv
$ grep -c true test1-data.csv 
     62

100 tests and 62 pass=true = 62% pass, 38% fail.

For test2 the results are 54% pass, 46% fail.

Error code 11602

I haven't checked every single "pass" but it seems pass=true always when Code:11602 is reported and pass=false when not reported. Example: in test1, test 8, the code is reported:

2017/07/03 17:13:03.244208 main.go:160: test 8 of 100
2017/07/03 17:13:03.244253 main.go:168: wait: 734ms
2017/07/03 17:13:03.270757 main.go:276: removing
2017/07/03 17:13:03.272239 main.go:197: wait
2017/07/03 17:13:03.272255 main.go:256: inserting...
2017/07/03 17:13:04.006344 main.go:201: rs.stepDown
2017/07/03 17:13:04.006741 main.go:260: insert err: operation was interrupted (&mgo.LastError{Err:"operation was interrupted", Code:11602, N:68000, Waited:0, FSyncFiles:0, WTimeout:false, UpdatedExisting:true, UpsertedId:interface {}(nil), modified:0, ecases:[]mgo.BulkErrorCase{}})
 
2017/07/03 17:13:20.336066 main.go:160: test 9 of 100
2017/07/03 17:13:20.336103 main.go:168: wait: 779ms
2017/07/03 17:13:20.362863 main.go:276: removing
2017/07/03 17:13:20.363162 main.go:197: wait
2017/07/03 17:13:20.363180 main.go:256: inserting...
2017/07/03 17:13:21.142266 main.go:201: rs.stepDown
2017/07/03 17:13:21.143102 main.go:260: insert err: operation was interrupted (&mgo.LastError{Err:"operation was interrupted", Code:0, N:66000, Waited:0, FSyncFiles:0, WTimeout:false, UpdatedExisting:false, UpsertedId:interface {}(nil), modified:0, ecases:[]mgo.BulkErrorCase{}})

And data for 8 and 9:

68000,68000,0.734,int,true 
66000,66448,0.779,int,false

Very similar test runs: ~66k docs and stepDown at ~0.7s, but 8 gets the error code and 9 does not. So far, this pattern holds for all the tests I've checked.

Conclusion

Root cause is unknown, but the test data proves that MongoDB is inconsistent wrt reporting the number of docs inserted (or a writeConcernError) when the replica state changes. It also proves that MongoDB doesn't report all the docs inserted (e.g. 66k vs. 66,448 above). Presuming the write concern guarantees the writes (i.e. docs aren't lost during a routine repl set state change), the next must-have for production apps is reliably knowing how many docs were inserted (for an ordered insert with a write concern). Without knowing this, apps cannot easily, gracefully recover from routine repl set state changes. Lastly, only insert has been tested. Update, delete, and all other write ops should be tested, too.

Comment by Daniel Nichter [ 26/Jun/17 ]

Description is correct now. Thanks Mark.

Comment by Daniel Nichter [ 26/Jun/17 ]

Copy-paste error duplicated the description. It repeats at "According to the docs". I can't edit it. Please delete description from start to second occurrence of "According to the docs ".

Generated at Thu Feb 08 04:22:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.