[SERVER-15580] Evaluating candidate query plans with concurrent writes on same collection may crash mongod Created: 09/Oct/14 Updated: 26/Sep/17 Resolved: 31/Oct/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | 2.6.3 |
| Fix Version/s: | 2.6.6, 2.8.0-rc0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Kalle Varisvirta | Assignee: | J Rassi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||
| Description |
|
Our MongoDB server (3 servers, in replica set) was running normally until suddenly the primary segfaulted. This is what's in the log (obviously, just the tail of the log):
If we caused this somehow, any advice how to avoid it would be hugely useful |
| Comments |
| Comment by Ramon Fernandez Marina [ 05/Nov/14 ] | ||||||||||||||||||||||||||||
|
mike@meshfire.com, a 2.6.6 release is currently scheduled for December. Please tune in to mongodb-announce to receive the release announcement when 2.6.6 is published. | ||||||||||||||||||||||||||||
| Comment by Michael Templeman [ 04/Nov/14 ] | ||||||||||||||||||||||||||||
|
Any timeline on release of 2.6.6? Hinting all of our potentially long-running queries. Our logs indicated that there may be a few long-running queries causing this problem. | ||||||||||||||||||||||||||||
| Comment by Githook User [ 31/Oct/14 ] | ||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: | ||||||||||||||||||||||||||||
| Comment by Githook User [ 31/Oct/14 ] | ||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: (cherry picked from commit e1ef3e6371f59c6a6396e78b2a4c4d9a975e9396) | ||||||||||||||||||||||||||||
| Comment by Kalle Varisvirta [ 10/Oct/14 ] | ||||||||||||||||||||||||||||
|
Thank you! We'll try hinting the long queries next. | ||||||||||||||||||||||||||||
| Comment by J Rassi [ 10/Oct/14 ] | ||||||||||||||||||||||||||||
|
Kalle: I forgot to mention that the second stack trace you posted indeed points to the same root cause. | ||||||||||||||||||||||||||||
| Comment by J Rassi [ 10/Oct/14 ] | ||||||||||||||||||||||||||||
|
The root cause of this bug is that documents can be added to the working set's "flagged documents" list while the KEEP_MUTATIONS stage is in the middle of iterating through that list. This is a regression introduced in version 2.6.0 of the server. The conditions to reproduce this are:
At step 3 above, KeepMutationsStage::work() saves a const_iterator to referring into _workingSet::_flagged, which is of type std::unordered_set<size_t>. At step 4, MultiPlanRunner::invalidate() calls ws->flagForReview(), which invokes insert() on the unordered_set. If the insert triggers a rehash of the set, then KeepMutationsStage's saved iterator is invalidated. At step 5, KeepMutationsStage::work() dereferences the invalidated iterator, which is undefined behavior. I can reproduce this with the following shell snippet:
When running this against the 2.6.5 Linux-64 mongod debug binary, the server outputs the following:
In my assessment, the only workaround available for this issue is to add hints to any long-running queries, which will disable use of the MultiPlanRunner. Kalle, I would advise that you add a hint to your long-running count queries, since the stack traces you posted indicated that both crashes happened during execution of a count. If this does not fix the issue for you, then you will need to downgrade to version 2.4.11 until version 2.6.6 is released. Setting this ticket's fix version to 2.6.6, 2.7.8. | ||||||||||||||||||||||||||||
| Comment by Kalle Varisvirta [ 10/Oct/14 ] | ||||||||||||||||||||||||||||
|
The same machine crashed again today:
Not sure whether this is the same reason. | ||||||||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 09/Oct/14 ] | ||||||||||||||||||||||||||||
|
Hi Kalle, Thanks for reporting this problem. We've found a potential concurrency bug that may in rare cases cause the issue you're seeing. We will investigate further and update the ticket once we know more. Regards, |