[SERVER-18844] Reacquire the snapshot after commit/abort Created: 05/Jun/15 Updated: 02/Aug/18 Resolved: 21/Apr/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Storage |
| Affects Version/s: | 3.3.4 |
| Fix Version/s: | 3.2.6, 3.3.5 |
| Type: | Task | Priority: | Critical - P2 |
| Reporter: | Igor Canadi | Assignee: | Kyle Suarez |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-only | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Completed: | |||||||||||||||||
| Sprint: | Integration 10 (02/22/16), Integration 13 (04/22/16) | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
If I understand correctly, Mongo's contract is to always reacquire the snapshot (i.e. do saveState + restoreState) after each commit/abort. There is (at least) one place in the code where this is not true. To demonstrate the issue I created a patch here: https://github.com/mongodb-partners/mongo/commits/v3.0-failure (on top of the current v3.0 branch) The patch keeps track of live iterators and makes sure that they're not reused after commit. Here's the stack trace of invalidation failure: https://gist.github.com/igorcanadi/c15eb094583054a0918a This is where the commit happens: https://github.com/mongodb/mongo/blob/master/src/mongo/db/catalog/index_create.cpp#L272, while `exec` plan executor is not yielding. This issue is causing background index build concurrency problems for RocksDB, as evidenced by https://jira.mongodb.org/browse/SERVER-18744. We've encountered this in production. It would also be good to add some invariants in the code to make sure this contract is respected, since it's a bit tricky behavior. Let me know if I'm misunderstanding anything. |
| Comments |
| Comment by Githook User [ 21/Apr/16 ] |
|
Author: {u'username': u'ksuarz', u'name': u'Kyle Suarez', u'email': u'ksuarz@gmail.com'}Message: There are two distinct fixes that must be done together:
Branch: v3.2 |
| Comment by Githook User [ 21/Apr/16 ] |
|
Author: {u'username': u'ksuarz', u'name': u'Kyle Suarez', u'email': u'ksuarz@gmail.com'}Message: There are two distinct fixes that must be done together:
Branch: master |
| Comment by Githook User [ 22/Oct/15 ] |
|
Author: {u'username': u'igorcanadi', u'name': u'Igor Canadi', u'email': u'icanadi@fb.com'}Message: This patch includes the following changes:
Signed-off-by: Ramon Fernandez <ramon@mongodb.com> |
| Comment by Michael Kania [ 23/Jul/15 ] |
|
any feedback here? |
| Comment by Michael Kania [ 17/Jul/15 ] |
|
What I've been able to find so far is operations on the database building the background index are stuck waiting on a lock. Looking at db.currentOp() doesn't show anything holding an exclusive lock. All other operations on that database have waitingForLock:true. So something seems to be lying about holding a database lock and I can't tell where. Also, lock contention isn't immediately triggered once the background index build starts. It only happens after some arbitrary period of time during the index build. I'm trying to see if a specific operation triggers this, but haven't found it yet. |
| Comment by Igor Canadi [ 17/Jul/15 ] |
|
We have been running my temporary fix [1] in production, but it's causing some performance issues. Specifically, bunch of getmore() calls get stuck when we build an index (they take 5 seconds, while they used to take 9ms). Once the index build is done, latencies drop back to normal. Do you have any idea what might cause this? I'm also interested in tweaks that Mathias was thinking about. Hopefully the tweaks will solve our problems [1] https://github.com/mongodb-partners/mongo/commit/434888864ef4415921a4f2ad2a184447f394eaa7 |
| Comment by Martin Bligh [ 29/Jun/15 ] |
|
Mathias had a quick look at this and was going to make some small tweaks before applying. |
| Comment by Igor Canadi [ 23/Jun/15 ] |
|
Hi Martin! Are there any updates for this task? I see that it's marked as "needs triage". Do you think the fix might make it into 3.0.5 version? I would also really appreciate if you could look at my fix (it's only two lines) and give your opinion on the correctness. |
| Comment by Igor Canadi [ 08/Jun/15 ] |
|
Here's my temporary fix: https://github.com/mongodb-partners/mongo/commit/434888864ef4415921a4f2ad2a184447f394eaa7 |
| Comment by Igor Canadi [ 05/Jun/15 ] |
|
s/not committed/not yielded |