[SERVER-28579] Data race involving capped collection truncation and PlanExecutor kill notifications Created: 31/Mar/17 Updated: 06/Dec/17 Resolved: 18/Apr/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 3.5.7 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | David Storch | Assignee: | David Storch |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Sprint: | Query 2017-05-08 |
| Participants: |
| Description |
|
There are several events which can cause all active query plan executors to be marked as killed:
Generally, these events involve the acquisition of a MODE_X lock on the collection, which means that any active queries must have yielded all locks. After obtaining exclusive access to the collection, we iterate the list of registered PlanExecutors and mark them as killed: https://github.com/mongodb/mongo/blob/r3.5.4/src/mongo/db/catalog/cursor_manager.cpp#L333-L339 This writes to PlanExecutor::_killReason. Whenever a PlanExecutor is used, it first consults PlanExecutor::_killReason. If the kill reason is set, an error is propagated to the caller. This means that the client will receive the appropriate error if, for example, the query's collection is dropped during its execution. The path for capped collection truncation, however, only requires a MODE_IX lock: https://github.com/mongodb/mongo/blob/r3.5.4/src/mongo/db/catalog/collection.cpp#L923 This means that the thread calling Collection::cappedTruncateAfter() can be writing to PlanExecutor::_killReason at the same time that the PlanExecutor is reading it! |
| Comments |
| Comment by Githook User [ 18/Apr/17 ] |
|
Author: {u'username': u'dstorch', u'name': u'David Storch', u'email': u'david.storch@10gen.com'}Message: |
| Comment by David Storch [ 14/Apr/17 ] |
|
It looks like the only callers are already holding at least a MODE_X lock on the collection being truncated. Therefore, Milkie's assessment that the bug is not impactful sounds correct. It also means that we can strengthen the contract for Collection::cappedTruncateAfter() to require that callers hold the collection lock in at least MODE_X. |
| Comment by Eric Milkie [ 01/Apr/17 ] |
|
Is it correct that we only call cappedTruncateAfter() as part of rollback? If so, there can be no readers of the oplog collection when this happens, so at least the current scope of this bug is not impactful. |