[SERVER-18844] Reacquire the snapshot after commit/abort Created: 05/Jun/15  Updated: 02/Aug/18  Resolved: 21/Apr/16

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Storage
Affects Version/s: 3.3.4
Fix Version/s: 3.2.6, 3.3.5

Type: Task Priority: Critical - P2
Reporter: Igor Canadi Assignee: Kyle Suarez
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-22062 Foreground index build may hang 3.0.x... Closed
related to SERVER-23807 Updates should always throw WriteConf... Closed
related to SERVER-16816 Should assertInActiveTxn() for WiredT... Closed
Backwards Compatibility: Fully Compatible
Backport Completed:
Sprint: Integration 10 (02/22/16), Integration 13 (04/22/16)
Participants:

 Description   

If I understand correctly, Mongo's contract is to always reacquire the snapshot (i.e. do saveState + restoreState) after each commit/abort.

There is (at least) one place in the code where this is not true. To demonstrate the issue I created a patch here: https://github.com/mongodb-partners/mongo/commits/v3.0-failure (on top of the current v3.0 branch)

The patch keeps track of live iterators and makes sure that they're not reused after commit. Here's the stack trace of invalidation failure: https://gist.github.com/igorcanadi/c15eb094583054a0918a

This is where the commit happens: https://github.com/mongodb/mongo/blob/master/src/mongo/db/catalog/index_create.cpp#L272, while `exec` plan executor is not yielding.

This issue is causing background index build concurrency problems for RocksDB, as evidenced by https://jira.mongodb.org/browse/SERVER-18744. We've encountered this in production.

It would also be good to add some invariants in the code to make sure this contract is respected, since it's a bit tricky behavior.

Let me know if I'm misunderstanding anything.



 Comments   
Comment by Githook User [ 21/Apr/16 ]

Author:

{u'username': u'ksuarz', u'name': u'Kyle Suarez', u'email': u'ksuarz@gmail.com'}

Message: SERVER-22970 fix update race with background index build

There are two distinct fixes that must be done together:

  • SERVER-23807: update should throw write conflict on unindex
  • SERVER-18844: background index builds should reacquire the snapshot to
    properly detect write conflicts with concurrent updates

Branch: v3.2
https://github.com/mongodb/mongo/commit/be81cc9e83d09d4dc206c59656fc7e51c3e4fc12

Comment by Githook User [ 21/Apr/16 ]

Author:

{u'username': u'ksuarz', u'name': u'Kyle Suarez', u'email': u'ksuarz@gmail.com'}

Message: SERVER-22970 fix update race with background index build

There are two distinct fixes that must be done together:

  • SERVER-23807: update should throw write conflict on unindex
  • SERVER-18844: background index builds should reacquire the snapshot to
    properly detect write conflicts with concurrent updates

Branch: master
https://github.com/mongodb/mongo/commit/f27b0cac4869fa506c6ed6f0dfc885b9edcd765a

Comment by Githook User [ 22/Oct/15 ]

Author:

{u'username': u'igorcanadi', u'name': u'Igor Canadi', u'email': u'icanadi@fb.com'}

Message: SERVER-20650 backport recent changes from MongoRocks

This patch includes the following changes:

  • Don't lazily initialize _writeBatch
  • Abandon snapshots in commitAndRestart()
  • Use only 33% of RAM for MongoRocks block cache
  • Decrease block cache shard count to 2^6 from 2^7
  • SERVER-18844 Refresh iterators on commit()
  • SERVER-17062 Allow cursors over unique indexes with duplicates
  • Recompute dataSize and numRecords if numbers don't make sense

Signed-off-by: Ramon Fernandez <ramon@mongodb.com>
Branch: v3.0
https://github.com/mongodb/mongo/commit/2a4e6cd054ad8b1d3c54ca73fc6d852116eb89c5

Comment by Michael Kania [ 23/Jul/15 ]

any feedback here?

Comment by Michael Kania [ 17/Jul/15 ]

What I've been able to find so far is operations on the database building the background index are stuck waiting on a lock. Looking at db.currentOp() doesn't show anything holding an exclusive lock. All other operations on that database have waitingForLock:true. So something seems to be lying about holding a database lock and I can't tell where.

Also, lock contention isn't immediately triggered once the background index build starts. It only happens after some arbitrary period of time during the index build. I'm trying to see if a specific operation triggers this, but haven't found it yet.

Comment by Igor Canadi [ 17/Jul/15 ]

We have been running my temporary fix [1] in production, but it's causing some performance issues. Specifically, bunch of getmore() calls get stuck when we build an index (they take 5 seconds, while they used to take 9ms). Once the index build is done, latencies drop back to normal. Do you have any idea what might cause this? I'm also interested in tweaks that Mathias was thinking about. Hopefully the tweaks will solve our problems

[1] https://github.com/mongodb-partners/mongo/commit/434888864ef4415921a4f2ad2a184447f394eaa7

Comment by Martin Bligh [ 29/Jun/15 ]

Mathias had a quick look at this and was going to make some small tweaks before applying.

Comment by Igor Canadi [ 23/Jun/15 ]

Hi Martin! Are there any updates for this task? I see that it's marked as "needs triage". Do you think the fix might make it into 3.0.5 version?

I would also really appreciate if you could look at my fix (it's only two lines) and give your opinion on the correctness.

Comment by Igor Canadi [ 08/Jun/15 ]

Here's my temporary fix: https://github.com/mongodb-partners/mongo/commit/434888864ef4415921a4f2ad2a184447f394eaa7

Comment by Igor Canadi [ 05/Jun/15 ]

s/not committed/not yielded

Generated at Thu Feb 08 03:48:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.