[SERVER-17303] concurrent findAndModify ops with upsert: true can cause a fatal logOp() rollback Created: 17/Feb/15  Updated: 18/Sep/15  Resolved: 19/Feb/15

Status: Closed
Project: Core Server
Component/s: Concurrency, Write Ops
Affects Version/s: 3.0.0-rc8
Fix Version/s: 3.0.0-rc9, 3.1.0

Type: Bug Priority: Major - P3
Reporter: Buzz Moschetti Assignee: David Storch
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo0.log     Text File rc9.log    
Issue Links:
Related
related to SERVER-17372 FindAndModify no longer returns "valu... Closed
is related to SERVER-15192 Make all logOp listeners rollback-safe Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

On that box:

$ cd /home/buzz
$ sh runt stateEvent --port 37017 -d test5 --numevents 2000000 --posn 100000 --threads 32 --drop --payload 8192

It will fail 90% of the time somewhere after 3000 and 7000 turns of the crank.

Participants:

 Description   

This is the big IBM X6 box. High perf SSDs on /data/[1-4]:

[buzz@IDF-IBM-Test-1 ~]$ uname -a
Linux IDF-IBM-Test-1.10gen.cc 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[buzz@IDF-IBM-Test-1 ~]$ cat /etc/redhat-release 
CentOS release 6.6 (Final)
[buzz@IDF-IBM-Test-1 ~]$ df
Filesystem           1K-blocks     Used Available Use% Mounted on
/dev/mapper/vg_idfibmtest1-lv_root
                      51606140 11926496  37058204  25% /
tmpfs                396812172        0 396812172   0% /dev/shm
/dev/sda1               495844    96335    373909  21% /boot
/dev/mapper/vg_idfibmtest1-lv_home
                     231167100   832892 218591592   1% /home
/dev/md1             768924576 20997468 708868020   3% /data/1
/dev/md2             768924576 10687204 719178284   2% /data/2
/dev/md3             768924576   201440 729664048   1% /data/3
/dev/md4             768924576  3242900 726622588   1% /data/4

Using rc8 with WiredTiger. No special startup options:

numactl --interleave=all /home/buzz/3.0.0-rc8/bin/mongod --storageEngine=wiredTiger --port 37017 --dbpath /data/4/data0/db0 --logpath /tmp/mongo0.log --fork

Test program starts 32 threads. Each thread randomly looks for a "position" Pn where 0 <= n < 10000, e.g. P433 in the currentPos collection. findAndModify is used to logically reserve the item. A small event record is inserted to the events collection, the fetched item is "copied" to the historicPos collection, and then currentPos is findAndModify()d with updated info. The find-insert-insert-update sequence we'll call a turn of the crank.

Trouble starts



 Comments   
Comment by Githook User [ 20/Feb/15 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-17303 findAndModify upsert calls Collection::insertDocument() directly

(cherry picked from commit 30d9e17410a3dec85ca2a148c745a6b8f9a8ecd0)
Branch: v3.0
https://github.com/mongodb/mongo/commit/3e03e174fdc6e40c5f69a411293b58387113a06e

Comment by Githook User [ 19/Feb/15 ]

Author:

{u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}

Message: SERVER-17303 findAndModify upsert calls Collection::insertDocument() directly
Branch: master
https://github.com/mongodb/mongo/commit/30d9e17410a3dec85ca2a148c745a6b8f9a8ecd0

Comment by Daniel Pasette (Inactive) [ 19/Feb/15 ]

resolving as duplicate of SERVER-15192, which depends on the rollback-able logOp() change being backported to 3.0.

Comment by David Storch [ 18/Feb/15 ]

I tracked down the issue. The findAndModify command is implemented in two parts:

  1. First, a query operation that sees if there is a matching document.
  2. Second, an update operation that finds the matching document by _id. It updates the document (and possibly saves a copy of the resulting object, if the new flag is set to true). If it is an {upsert: true} findAndModify and the query part does not find a match, than an insert will occur.

This problem occurs for {upsert: true} findAndModify operations. First, the query part fails to find a matching document. Then an {upsert: true} update op is issued in order to perform the insert. However, this update operation results in an update to an existing document, rather than performing an insert. This indicates that a concurrent writer inserted a matching document in between the query part and update part of the findAndModify. The implementation reacts to this condition by throwing a WriteConflictException:

https://github.com/mongodb/mongo/blob/ea5f871b550c1c3a8a5f0cd749fb47570557a067/src/mongo/db/commands/find_and_modify.cpp#L411-L418

The intention is that this exception will cause the findAndModify operation to restart from the beginning, rolling back any updates before they commit. However, the update has already issued a logOp(). The result is that we attempt to roll back a logOp(), which is currently invalid. SERVER-15192 tracks work to make roll back of logOp() possible.

Comment by Buzz Moschetti [ 18/Feb/15 ]

Yes, it's findAndModify. I have attached the log named rc9.log. Search for Fatal.

Comment by David Storch [ 18/Feb/15 ]

Hi buzz.moschetti, can you provide a stack trace from the crash you observed against rc9? Did this happen during a findAndModify operation, as seen for rc8? If so, can you paste the command that hit the assertion? Thanks!

Comment by Buzz Moschetti [ 18/Feb/15 ]

Oz provided me with an rc9 build last night:
2015-02-18T04:57:49.506-0500 I CONTROL [initandlisten] db version v3.0.0-rc9
2015-02-18T04:57:49.506-0500 I CONTROL [initandlisten] git version: ea5f871b550c1c3a8a5f0cd749fb47570557a067 modules: enterprise

I can reproduce the problem on this build. Just tried it again to double check and it crashed.
2015-02-18T04:59:06.135-0500 F REPL [conn17] Rollback of logOp not currently allowed (SERVER-15192)
2015-02-18T04:59:06.135-0500 I - [conn17] Fatal Assertion 18805

Comment by David Storch [ 18/Feb/15 ]

Hi buzz.moschetti,

It looks like the fassert() you observed while running this workload was due to SERVER-17198, fixed in 3.0.0-rc9. Would you be able to re-run the workload against rc9 in order to verify that the problem has been fixed?

Best,
Dave

Generated at Thu Feb 08 03:43:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.