[JAVA-948] WriteConcern.SAFE isn't. Created: 19/Sep/13 Updated: 04/Dec/13 Resolved: 26/Nov/13 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Write Operations |
| Affects Version/s: | 2.10.0, 2.10.1, 2.11.0, 2.11.1, 2.11.2, 2.11.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Michael Lehenbauer | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Running against a mongod 2.0.2 server (but I believe the issue exists with latest mongod as well). |
||
| Description |
|
We recently upgraded from mongo-java-driver 2.9.0 to 2.11.2 and started getting DuplicateKey exceptions in a scenario where we do a remove for a key followed by an insert. All of our operations use WriteConcern.SAFE, so they should be serialized and the DuplicateKey exception should be impossible. I dug into the code and found the following change in https://github.com/mongodb/mongo-java-driver/commit/b4d0a6ee933b268f9ca0b890adc4aed1bf30e097 :
It looks to me like the driver is /not/ sending 'w' with the GLE command when 'w' is 1 (which corresponds to WriteConcern.SAFE or WriteConcern.ACKNOWLEDGED). But I believe it should be. If it doesn't send 'w', mongod does not wait for the previous operation to finish. So we are effectively getting WriteConcern.NORMAL instead. |
| Comments |
| Comment by Jeffrey Yemin [ 26/Nov/13 ] | |||||||||||
|
HI Michael, Yes, the page is accurate. You can't really tell from looking at the mongod code, because w=0 is implemented at the driver level (by simply not sending a getlasterror command after the write. | |||||||||||
| Comment by Michael Lehenbauer [ 20/Sep/13 ] | |||||||||||
|
Hey Jeff, Looks like I owe you an apology. we rolled back to 2.9.3 and are still hitting this issue, so it must not be related to the change I was concerned about. I will continue debugging, but at this point, it's probably safe to assume it's an issue in our code rather than the mongo driver. Feel free to close this issue (Again, very sorry for the false alarm). Out of curiosity, do you know if this page is accurate? http://docs.mongodb.org/manual/core/write-concern/ From the diagrams, it sure looks like a GLE with w=0 does not wait for the write to complete, while a GLE with w=1 does. But from the mongod code, it seems like w=0 and w=1 behave the same? This was the main source of my confusion. Thanks for your time, | |||||||||||
| Comment by Jeffrey Yemin [ 20/Sep/13 ] | |||||||||||
|
You might try logging WriteResult.getN() for the value returned from remove, to see if that suggests anything (it tells you how many documents were removed). | |||||||||||
| Comment by Michael Lehenbauer [ 19/Sep/13 ] | |||||||||||
|
Hrm. Let me poke at this a bit more and get back to you. Perhaps I am indeed missing something. It all seemed to add up, but you're right. Looking at the write_concern code, surprisingly I don't see any obvious action that happens when w=1. I had assumed that the opReplicatedEnough() call at https://github.com/mongodb/mongo/blob/master/src/mongo/db/write_concern.cpp#L163 would wait for something to happen, but it seems to just return true for w=1. Regarding our code, here it is (in scala). It's doing something akin to storing a filesystem in mongo, where the keys are paths to files. val query = this.constructPrefixQueryForPath(path); //And then write new ones. constructPrefixQueryForPath() constructs a regex query for a path like "^foo/bar/" and getDocumentWriteList() constructs a list of DBObjects with ids like "foo/bar/file1", "foo/bar/file2", etc. As you can see, the insert relies on the remove having completed completely. We have locks in place to make sure there are no other similar operations taking place at the same time. And until we upgraded from 2.9.0 we never had any issues. My assumption is that there's a race condition where the insert might actually get executed in mongo before the remove has completed. But maybe something else is going on. If you have any guesses, let me know, else I'll keep digging. | |||||||||||
| Comment by Jeffrey Yemin [ 19/Sep/13 ] | |||||||||||
|
Hi Michael, I believe you're missing something, because I was looking at the same exact code and missed it too at first. There is nothing in there that is actually waiting for the operation to complete. In other words, in the case of "w" of 1, there is nothing in that code path inside the "if e.isOk()" that actually does anything. Initially, I thought that https://github.com/mongodb/mongo/blob/master/src/mongo/db/write_concern.cpp#L115 was waiting for something to happen, but it's not: it's just getting the last timestamp written to the oplog. In fact, by the time this code has executed, everything that is needed to satisfy a "w" of 1 for the previous write has already been completed. This is a consequence of how mongod pulls messages off the wire. In this case, the first message sent on the socket is an OP_REMOVE. mongod processes the remove, then pulls the next request off the socket, which is the OP_QUERY for the "getlasterror" command. For "w" of 1, or a missing "w", this command returns immediately, and so long as your application waits for the response before putting the OP_INSERT on the wire (on the same, or a different, socket), you should not get a duplicate key exception (assuming that the document you inserted has the same _id as the one(s) you removed). Note that if the OP_INSERT followed the OP_REMOVE on the same socket, the "getlasterror" wouldn't actually matter, for the same reason as above: before pulling the OP_INSERT off the wire, the OP_REMOVE has been completed. One other thing I was wondering. Why not just use DBCollection.update, instead of a remove followed by an insert? It might help if you post a code fragment, since at the moment I don't have a theory as to what could be causing this exception. | |||||||||||
| Comment by Michael Lehenbauer [ 19/Sep/13 ] | |||||||||||
|
Thanks for the reply. Unfortunately, it's not easy to reproduce. We do ~1500 mongo ops/sec in production and we are hitting this every few days. If you really need it, I can see if I can come up with a stress test that reproduces the error, but since it's a race condition, I'm not super excited about trying to reproduce it reliably. Can you take a look at the mongod code here: https://github.com/mongodb/mongo/blob/master/src/mongo/db/write_concern.cpp#L87 What I found is that when I use mongo driver 2.9.0, cmdObj['w'] exists and is set to 1, so e.ok() returns true, and it enters the if condition and eventually hits the while(1) loop at https://github.com/mongodb/mongo/blob/master/src/mongo/db/write_concern.cpp#L152 which actually waits for the operation to complete. But with mongo driver 2.11.2, cmdObj['w'] does not exists, so e.ok() returns false, and we skip that whole block of code. So it does /not/ wait for the operation to complete. NOTE: I debugged on mongod 2.0.2, but I believe 2.4 would behave the same. Am I missing something? | |||||||||||
| Comment by Jeffrey Yemin [ 19/Sep/13 ] | |||||||||||
|
Hi Michael, Thanks for the report. I don't think you are correct about the behavior of the getlasterror command, though. If the server is sent this command, the semantics are that the write is acknowledged before the reply is sent, whether or not there is a "w" field present. If there was a bug in the Java driver, it would be around not sending a "getlasterror" command at all, which is controlled by the method WriteConcern.callGetLastError (https://github.com/mongodb/mongo-java-driver/blob/master/src/main/com/mongodb/WriteConcern.java#L407), which returns true even if w == 1. In order to rule out an error in your application, would it be possible for you to provide a test case which demonstrates the problem? For reference, I wrote the following test, which does not throw an exception on either 2.0 or 2.4:
| |||||||||||
| Comment by Michael Lehenbauer [ 19/Sep/13 ] | |||||||||||
|
Looks like the diff got reformatted in an undesirable way. See here for the change I am referring to: https://github.com/mongodb/mongo-java-driver/commit/b4d0a6ee933b268f9ca0b890adc4aed1bf30e097#L6L233 |