[SERVER-20829] RUNNER_DEAD on document delete during update by _id or find by _id Created: 08/Oct/15 Updated: 22/Mar/16 Resolved: 28/Jan/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 2.6.9 |
| Fix Version/s: | 2.6.12 |
| Type: | Bug | Priority: | Minor - P4 |
| Reporter: | James Wahlin | Assignee: | David Storch |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Completed: | |||||
| Sprint: | QuInt C (11/23/15), Query F (02/01/16) | ||||
| Participants: | |||||
| Description |
|
Given the following scenario:
The update should probably return successful with nupdated: 0 rather than error out. Looks like this is an issue for 2.6 only. https://github.com/mongodb/mongo/blob/v2.6/src/mongo/db/query/idhack_runner.cpp#L234-L236 |
| Comments |
| Comment by Githook User [ 28/Jan/16 ] |
|
Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}Message: |
| Comment by J Rassi [ 03/Nov/15 ] |
|
james.wahlin, I'm tentatively assigning this ticket to you. Please assign this back to me if you don't think you'll have time to get to this in the next week or two, and feel free to swing by the query team if you want to go over this in person. |
| Comment by David Storch [ 23/Oct/15 ] |
|
jon@appboy.com, Ramon is right that this discussion is best suited for other channels, but I thought I'd still answer your outstanding question. The guarantees under concurrency offered by MongoDB 3.0 are the same regardless of the storage engine configuration. In general, queries against a mongod using WiredTiger can miss results. That said, I believe you are correct that this particular behavior of two contending updates by _id will not manifest on WiredTiger, since we will retry on write conflict. |
| Comment by Ramon Fernandez Marina [ 23/Oct/15 ] |
|
Thanks for your interest in MongoDB internals jon@appboy.com. This ticket is to get the RUNNER_DEAD error reported by James fixed in future versions of the server. For MongoDB-related discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your questions will reach a larger audience. Questions like the ones above involving more discussion would be best posted on the mongodb-user group. If you're interested in MongoDB development you can use the mongodb-dev group instead. See also our Technical Support page if you need additional support resources. Regards, |
| Comment by Jon Hyman [ 19/Oct/15 ] |
|
Thanks, Dave. I'm reading the section on isolation and note these two things, which are claimed to hold true for both MMAPv1 and WiredTiger: > Non-point-in-time read operations. Suppose a read operation begins at time t1 and starts reading documents. A write operation then commits an update to a document at some later time t2. The reader may see the updated version of the document, and therefore does not see a point-in-time snapshot of the data. In our situation, we have two processes that are doing something like this: db.my_collection.update({_id: X}, {foo: "bar"}) ) and it may be that case that one of these writes does not apply if update wins the race and causes the document to move; this is hinted at in the "Dropped results" section. Do you know if this would be the case in WiredTiger as well? My understand of WT is that due to its optimistic concurrency control, "when the storage engine detects conflicts between two operations, one will incur a write conflict causing MongoDB to transparently retry that operation" which I take to interpret as saying that both updates will get applied. Is it then just MMAPv1 that can fail to apply whichever write is the loser if a move happens? |
| Comment by David Storch [ 19/Oct/15 ] |
|
jon@appboy.com, in general queries may miss concurrently modified documents. For more details on the guarantees provided by the system under concurrency, please refer to the documentation: http://docs.mongodb.org/manual/faq/concurrency/. Best, |
| Comment by Jon Hyman [ 16/Oct/15 ] |
|
Thanks. So if two threads make an update by id, but one causes a move, the other (if it loses the race) will fail to apply. In any highly concurrent system I would imagine this happens relatively frequently, as I'm seeing in our case. Are you suggesting that the "correct" and expected behavior is for one write to not apply? So we are to assume that any update by id (which in our case is the shard key) can return to the application and just not apply? Are we supposed to be confirming that all writes actually apply? Otherwise it seems like we're going to be prone to silent failures if the driver returns success. |
| Comment by David Storch [ 16/Oct/15 ] |
|
Hi Jon,
It depends on which write you mean, because there may be two writer threads. Say there is one thread doing a big update which causes an MMAP document move, and another thread which is doing a small update-by-id. If the first writer wins the race, is applied to the database, and causes the move to occur, the find-by-id write will not occur. It is failing with RUNNER_DEAD on 2.6, but the expected behavior is that the update-by-id will report success and that it updated zero documents.
This seems like an application-level decision. Drivers are not expected to retry the write in this case, but the application may want to do so.
This ticket is still marked as "Needs Triage", so we have yet to decide whether to schedule this for a 2.6 release. Please watch the ticket for further updates on this. |
| Comment by Jon Hyman [ 16/Oct/15 ] |
|
Thanks for responding. To clarify, in the case where a move happens, the write DOES occur, but Mongo just returns a failed error? I'm going to monkey patch the driver for this since we get it often (and it raises an unhandled exception for us). Should I retry the write or just ignore the error? Is this slated to be fixed in 2.6? We can't upgrade to 3.0 at the moment due to issues we're having with the official ruby driver that I'm trying to figure out separately with that team. |
| Comment by David Storch [ 16/Oct/15 ] |
|
Hi jon@appboy.com, Thanks for the feedback. After investigating the issue, it looks like this is a benign mistake in error reporting. The query engine is reporting an error on certain operations involving a find-by-_id when it should be reporting that the operation succeeded, but the document was not found. It is indeed the case that this can happen on when an update causes the document to be moved inside the MMAP storage engine. The expected behavior in this case is that the operation succeeds but does not find the document. The fix will be to eliminate these spuriously reported errors. Best, |
| Comment by Jon Hyman [ 16/Oct/15 ] |
|
In our highly concurrent environment, we're seeing this a couple of times per day after upgrading from 2.4.10 to 2.6.11. Actually, it may be slightly different because in our situation, my guess is that the document is moving. In this issue, Asya says that it could happen if the document moves. https://groups.google.com/forum/#!searchin/mongodb-user/runner_dead/mongodb-user/dLgg8QBmUcY/RILrOjg8BQAJ For errors it is happening to, I checked that the document was not deleted. |
| Comment by David Storch [ 08/Oct/15 ] |
|
Specifically the problem is that the IDHackRunner is marking itself as killed when the document it is trying to find by _id gets deleted. When this situation occurs on the 3.0 branch or on the master branch, we do not mark the query as killed. Instead, we make a copy of the document. The copy will get returned on the following call to getNext(). https://github.com/mongodb/mongo/blob/r3.0.6/src/mongo/db/exec/idhack.cpp#L201-L207 We should probably do something similar inside the IDHackRunner on the 2.6 branch. |