[SERVER-14969] Dropping index during active aggregation operation can crash server Created: 20/Aug/14 Updated: 11/Jul/16 Resolved: 25/Aug/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework, MapReduce, Querying |
| Affects Version/s: | 2.6.4 |
| Fix Version/s: | 2.6.5, 2.7.6 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Roy | Assignee: | J Rassi |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 12.04.5 |
||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | Linux | ||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||
| Steps To Reproduce: | It happened during an index drop, not sure if that's the reason, or something happening concurrently (e.g. index drop & index query?) |
||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
| Comments |
| Comment by Githook User [ 25/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: (cherry picked from commit 71e2312d2ad6d418ea223d6e003a065122c926d8) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 25/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'jrassi', u'name': u'Jason Rassi', u'email': u'rassi@10gen.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 25/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The "killing agg executor doesn't kill underlying executor" part of this ticket is actually separate from (and has a separate cause from) the "invalidateAll doesn't kill agg executors" part of this ticket. Broke out the former into a new ticket at | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 21/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The issue can be caused by running the dropIndexes command at the same time as an aggregate command. My reproducer follows. With it, I can reproduce the issue in master (2.7.6-pre) and 2.6.4, but not 2.4.11.
The sequence of events that causes the crash is: 1) the aggregation operation acquires a read lock, gets a PipelineRunner with a document source pipeline stage (which will use an index), and releases the read lock, 2) the dropIndexes operation acquires a write lock, drops the index, and releases the write lock, and 3) the aggregation acquires a read lock (in the document source stage pipeline), the document source pipeline stage attempts to read from the underlying runner, and the crash occurs (since the index has gone away). As it affects the 2.6 branch, the issue stems from the fact that aggregation cursors aren't properly cleaned up in CollectionCursorCache::invalidateAll(), which is called from the "drop index" machinery. See collection_cursor_cache.cpp:321. A typical aggregation operation has a "user-facing" cursor (which has isAggCursor set to true, and is registered with the collection cursor cache), and an underlying cursor that points into the collection or index being scanned for the document source pipeline stage (which has isAggCursor set to false, and is not registered with the collection cursor cache). invalidateAll() needs to invoke kill() on the PipelineRunner associated with the registered aggregation cursor, which invokes kill() on the underlying runner. However, as currently written, invalidateAll() does not call kill() on the runner for aggregation cursors; this is incorrect. The issue is more complicated in 2.7.6-pre, however. The Runner abstraction has been removed and replaced with PlanExecutor, and the PlanExecutor stage tree is not notified of kill() operations. It is still the case that the "user-facing" cursor is registered with the collection cursor cache, but even if kill() is invoked on the associated PlanExecutor, the API doesn't allow for the kill to be propagated down to the underlying executor; kill() on a PlanExecutor merely sets the "_killed" flag. The underlying executor needs to be told about the kill, because the parent executor may be in the middle of a getNext() call when the invalidate happens (note that executors with a PipelineProxyStage root execute under no lock; the locking is performed by DocumentSourceCursor when interacting with the underlying executor). Here's a stack trace for the issue in 2.7.6-pre:
Assigning to hari.khalsa@10gen.com for triage. cc redbeard0531. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Roy [ 20/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 20/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Roy, We are able to reproduce this issue. The information I requested earlier is no longer needed. Thanks again for the report. Please continue to watch this ticket for updates on when a fix may be available, and for possible workaround information. ~ Jason Rassi | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 20/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Demangled stack trace with file/line info:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 20/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I also noticed that the stack trace pasted from the log snippet indicates that the server crash was during a run of the "aggregate" command, not the "mapReduce" command (and, the log does indicate that the same database thread that was running "aggregate" at 10:05 was running "mapReduce" at 8:02). So, I'll amend my second request for information as such: do you know what the full invocation was of the /aggregate/ command that was being run on this secondary at the time? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by J Rassi [ 20/Aug/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Roy, I'd like to gather additional information to further diagnose this issue:
Thanks for reporting the issue. ~ Jason Rassi |