[SERVER-17303] concurrent findAndModify ops with upsert: true can cause a fatal logOp() rollback Created: 17/Feb/15 Updated: 18/Sep/15 Resolved: 19/Feb/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency, Write Ops |
| Affects Version/s: | 3.0.0-rc8 |
| Fix Version/s: | 3.0.0-rc9, 3.1.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Buzz Moschetti | Assignee: | David Storch |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Completed: | |||||||||||||
| Steps To Reproduce: | On that box:
It will fail 90% of the time somewhere after 3000 and 7000 turns of the crank. |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
This is the big IBM X6 box. High perf SSDs on /data/[1-4]:
Using rc8 with WiredTiger. No special startup options:
Test program starts 32 threads. Each thread randomly looks for a "position" Pn where 0 <= n < 10000, e.g. P433 in the currentPos collection. findAndModify is used to logically reserve the item. A small event record is inserted to the events collection, the fetched item is "copied" to the historicPos collection, and then currentPos is findAndModify()d with updated info. The find-insert-insert-update sequence we'll call a turn of the crank. Trouble starts |
| Comments |
| Comment by Githook User [ 20/Feb/15 ] |
|
Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}Message: (cherry picked from commit 30d9e17410a3dec85ca2a148c745a6b8f9a8ecd0) |
| Comment by Githook User [ 19/Feb/15 ] |
|
Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}Message: |
| Comment by Daniel Pasette (Inactive) [ 19/Feb/15 ] |
|
resolving as duplicate of |
| Comment by David Storch [ 18/Feb/15 ] |
|
I tracked down the issue. The findAndModify command is implemented in two parts:
This problem occurs for {upsert: true} findAndModify operations. First, the query part fails to find a matching document. Then an {upsert: true} update op is issued in order to perform the insert. However, this update operation results in an update to an existing document, rather than performing an insert. This indicates that a concurrent writer inserted a matching document in between the query part and update part of the findAndModify. The implementation reacts to this condition by throwing a WriteConflictException: The intention is that this exception will cause the findAndModify operation to restart from the beginning, rolling back any updates before they commit. However, the update has already issued a logOp(). The result is that we attempt to roll back a logOp(), which is currently invalid. |
| Comment by Buzz Moschetti [ 18/Feb/15 ] |
|
Yes, it's findAndModify. I have attached the log named rc9.log. Search for Fatal. |
| Comment by David Storch [ 18/Feb/15 ] |
|
Hi buzz.moschetti, can you provide a stack trace from the crash you observed against rc9? Did this happen during a findAndModify operation, as seen for rc8? If so, can you paste the command that hit the assertion? Thanks! |
| Comment by Buzz Moschetti [ 18/Feb/15 ] |
|
Oz provided me with an rc9 build last night: I can reproduce the problem on this build. Just tried it again to double check and it crashed. |
| Comment by David Storch [ 18/Feb/15 ] |
|
Hi buzz.moschetti, It looks like the fassert() you observed while running this workload was due to Best, |