[SERVER-15038] Multiple background index builds may not interrupt cleanly for commands, on secondaries Created: 26/Aug/14 Updated: 02/Mar/15 Resolved: 28/Aug/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.6.4 |
| Fix Version/s: | 2.6.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Edu Herraiz | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Issue Status as of Aug 28, 2014 ISSUE SUMMARY
In some rare cases, this issue may crash secondary nodes. USER IMPACT If a secondary node crashes, it needs to be restarted. If the secondary has become stale, it needs to be resynchronized by removing its data and performing an initial sync. WORKAROUNDS
AFFECTED VERSIONS FIX VERSION RESOLUTION DETAILS Original descriptionAfter the bugfix in 2.6.4 related to this bug https://jira.mongodb.org/browse/SERVER-14494 , we update to 2.6.4, but we still reproducing the error:
|
| Comments |
| Comment by Eric Milkie [ 28/Aug/14 ] | ||||
|
The following are workarounds for this issue. | ||||
| Comment by Githook User [ 28/Aug/14 ] | ||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: Scenario: On a secondary; multiple background index builds are running on the same collection. Those builds are now yielding and the replication applier now attempts to apply a command that requires the index builds to terminate, such as a drop collection command. Such index builds' curops are flagged as killed, and then the command proceeds. The drop command asserts with BgOpInProg, which causes the replication applier to yield, wait for the index build operations to recover from yield and clean themselves up, and then runs the drop command again, expecting it to succeed this time. When the first index build recovers from yield, it finds its cursor in the cursor cache just fine (since the drop command hasn't yet run successfully; it's yielding in the applier code). It then processes one or more documents to index. Eventually, it gets around to yielding again, at which point it notices that it has been interrupted and begins to clean itself up. Part of the cleanup code is to call dropIndex() to remove itself from the catalog. This function calls invalidateAll() on the cursor cache, which some might say is way overkill, as it invalidates all cursors in the collection, not just ones that are currently pointing at the index being dropped. The cursor invalidation includes the cursor that the second index build is using to build its index. The fix is, after an index build yields and regains its lock, to check for interrupt before recovering our cursor. | ||||
| Comment by Eric Milkie [ 27/Aug/14 ] | ||||
|
Scenario: On a secondary; multiple background index builds are running on the same collection. Those builds are now yielding and the replication applier now attempts to apply a command that requires the index builds to terminate, such as a drop collection command. Such index builds' curops are flagged as killed, and then the command proceeds. The drop command asserts with BgOpInProg, which causes the replication applier to yield, wait for the index build operations to recover from yield and clean themselves up, and then runs the drop command again, expecting it to succeed this time. When the first index build recovers from yield, it finds its cursor in the cursor cache just fine (since the drop command hasn't yet run successfully; it's yielding in the applier code). It then processes one or more documents to index. Eventually, it gets around to yielding again, at which point it notices that it has been interrupted and begins to clean itself up. Part of the cleanup code is to call dropIndex() to remove itself from the catalog. This function calls invalidateAll() on the cursor cache, which some might say is way overkill, as it invalidates all cursors in the collection, not just ones that are currently pointing at the index being dropped. The cursor invalidation includes the cursor that the second index build is using to build its index. The fix is, after an index build yields and regains its lock, to check for interrupt before recovering our cursor. | ||||
| Comment by Ramon Fernandez Marina [ 27/Aug/14 ] | ||||
|
Thanks for the additional information eduherraiz. After some investigation we've found an issue that we believe is the cause of the problem. Working on a fix now. Regards, | ||||
| Comment by Edu Herraiz [ 27/Aug/14 ] | ||||
|
This is the log for the last 5 days in lohap2. | ||||
| Comment by Edu Herraiz [ 27/Aug/14 ] | ||||
|
Ramon Fernandez, I'm very sure this node was running the binary 2.6.4, it's just a replicaset with 2 nodes and 1 arbiter. And the problem was replicated twice after the upgrade (the fix requires restart). | ||||
| Comment by Ramon Fernandez Marina [ 26/Aug/14 ] | ||||
|
eduherraiz, do you have full logs for the affected node? Could it be that the mongod binary was upgraded to 2.6.4 while this node was still running an earlier version? The reason I'm asking is because I have not yet been able to reproduce this problem. What I see is:
Can you please upload full logs for host lohap2? Thanks, |