[SERVER-17174] Updates can fail to find document during high concurrency Created: 04/Feb/15  Updated: 27/Oct/16  Resolved: 03/Apr/15

Status: Closed
Project: Core Server
Component/s: Concurrency, Write Ops
Affects Version/s: 2.6.5, 2.6.7, 3.0.0-rc7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ronan Bohan Assignee: Unassigned
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.snippet.log     File prep.sh     File repro.js     File repro2.js     File reset.sh     File run-js.sh     File run-js2.sh    
Issue Links:
Duplicate
duplicates SERVER-14703 Snapshot queries can miss records if ... Closed
Related
related to SERVER-12018 Multi-update can update same doc more... Closed
related to SERVER-14766 Indexed queries should not miss docum... Closed
related to SERVER-13992 UpdateTicket should handle record moves Closed
related to SERVER-700 Overwrite index key rather than delet... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

See the attached files:

export MONGO=`which mongo`
export MONGOD=`which mongod`
 
./reset.sh
# it may be necessary to run the following command a few times.
./run-js.sh

Participants:
Case:

 Description   

With a cold start start certain 'update' commands fail when they are expected to succeed.

Scenario can be triggered by cold-starting mongod and having multiple threads/processes update the same document.

Have reproduced this on both OSX and Ubuntu using the attached scripts:

  • reset.sh : starts an empty mongod and inserts 10 documents containing just an _id field
  • run-js.sh: cold starts mongod and sets log level 1 (via prep.sh script). Then it starts 10 subprocesses, each one updating the same document via $addToSet.

The run-js.sh script may need to be run a few times. Each time it targets a single document and tries to blast it with multiple concurrent updates. It will cycle through all 10 documents at a rate of one a second. The fact that it typically doesn't occur during the first run might indicate the problem is exposed more easily when dealing with larger objects.

Under certain circumstances some of these updates fail; the update however should have succeeded and can be manually applied later on.

The common characteristic seems to be:

  • nMatched:0 and nModified:0
  • numYields:1

The nscanned metrics in each case show the document was found. However, the operation yields (numYields:1) after which the result is nMatched:0 and nModified:0, ie the update has not been applied.

These particular operations also seem to hold the 'w' lock for a lot less time than those which succeed. This is probably to be expected given the behavior but is another key indicator of the problem.

I thought this might have been due to document moves, ie this document is found but the operation yields before the update is applied; during this time another update comes in and successfully updates the document, moving it in the process, such that when the yielded operation resumes it cannot find the document. This does not appear to be borne out by the available evidence, i.e., I have seen many circumstances where there are no document moves taking place and yet the problem persists.

Note too: it is not sufficient for the update operation to yield - there are occasions where the operation does yield and the update still completes successfully.

I have also tried to force the document into memory before the update by issuing a 'findOne()' and then the 'update()' but I still see the problem.

FYI I also have a slightly more complex repro written in Java which can switch between the standard Update command, the Bulk Update API and 'findAndModify()' - both Update methods suffer this problem but 'findAndModify' does not appear to. Interestingly the 'findAndModify()' does not yield - so perhaps that's a clue. (I can upload this Java version too if required).

Log file snippet showing problem:

2015-02-04T12:01:22.971+0000 [conn9] update foo.bar query: { _id: 1, c.sb.name: { $ne: "Update::1::18::a158ac61-35d9-45c8-9412-f2733f41585c" } } update: { $addToSet: { c.sb: { name: "Update::1::18::a158ac61-35d9-45c8-9412-f2733f41585c" } } } nscanned:1 nscannedObjects:1 nMatched:0 nModified:0 keyUpdates:0 numYields:1 locks(micros) w:2099 14ms

Counter example showing an update which yields and yet succeeds:

2015-02-04T12:00:03.260+0000 [conn4] update foo.bar query: { _id: 6, c.sb.name: { $ne: "Update::6::221::d2b6ab90-cd2a-4dd5-b246-c06276d33768" } } update: { $addToSet: { c.sb: { name: "Update::6::221::d2b6ab90-cd2a-4dd5-b246-c06276d33768" } } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 numYields:1 locks(micros) w:4004 8ms

I have tested this on MongoDB 2.6.5, 2.6.7 and 3.0.0-rc7 - all are affected.

I have tested my java version on 2.4.12 and it does not appear to be affected, indicating this is a regression.



 Comments   
Comment by Ronan Bohan [ 09/Feb/15 ]

Attaching mongod.snippet.log where you can see the issue occurring at 22:59:41, as follows:

2015-02-07T22:59:41.152+0000 [conn2] update foo.bar query: { _id: 1, c.sb.name: { $ne: "Update::10::0::cae9kl1fsrpsc3di" } } update: { $addToSet: { c.sb: { name: "Update::10::0::cae9kl1fsrpsc3di" } } } nscanned:1 nscannedObjects:1 nMatched:0 nModified:0 keyUpdates:0 numYields:1 locks(micros) w:3457 7ms

ie nMatched:0 and nModified:0 with numYields:1 for an update that is expected to work. There do not appear to be any document moves occurring around this time.

Note: this was captured with MongoDB 2.6.7 running on my Mac.

Comment by Asya Kamsky [ 09/Feb/15 ]

Could you attach a log snippet showing a run with such a failure (at logLevel 1)? Preferably from latest available version, but 2.6 would be ok too.

Comment by Scott Hernandez (Inactive) [ 04/Feb/15 ]

This sounds like expected behavior wrt yielding and concurrency, is there some reason you think this is a bug?

There are many cases where yielding on a document, while a write is happening to that same document, will cause the cursor to move past that document, therefore "missing" it. This can happen due to a move, or index update (due to the fact that an index update is remove+insert in the btree) for example.

Generated at Thu Feb 08 03:43:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.