[SERVER-15038] Multiple background index builds may not interrupt cleanly for commands, on secondaries Created: 26/Aug/14  Updated: 02/Mar/15  Resolved: 28/Aug/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.4
Fix Version/s: 2.6.5

Type: Bug Priority: Major - P3
Reporter: Edu Herraiz Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.log    
Issue Links:
Depends
Related
related to SERVER-14494 Dropping collection during active bac... Closed
is related to SERVER-15871 Segmentation fault in secondary rebui... Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   
Issue Status as of Aug 28, 2014

ISSUE SUMMARY
On a replica set, when a secondary is building multiple background indices on a given collection and that collection is dropped, the index builds may not be interrupted properly and the secondary may fail with an assertion error as follows:

2014-08-26T11:18:13.384-0400 [repl writer worker 1] Assertion: 17348:cannot dropAllIndexes when index builds in progress

In some rare cases, this issue may crash secondary nodes.

USER IMPACT
The affected collection and indices may not be successfully dropped on secondary nodes, so if one of these nodes later becomes primary, the collection and indices may still be present.

If a secondary node crashes, it needs to be restarted. If the secondary has become stale, it needs to be resynchronized by removing its data and performing an initial sync.

WORKAROUNDS
The workaround is to avoid the situation of multiple background index builds concurrent with collection drops:

  • by making the indexes foreground builds, or
  • by synchronizing collection drop with index build completion on secondaries through the use of write concern on the background index builds

AFFECTED VERSIONS
MongoDB 2.6 production releases up to 2.6.4 are affected by this issue.

FIX VERSION
The fix is included in the 2.6.5 production release.

RESOLUTION DETAILS
After an index build yields and regains its lock, check for interrupt before recovering our cursor.

Original description

After the bugfix in 2.6.4 related to this bug https://jira.mongodb.org/browse/SERVER-14494 , we update to 2.6.4, but we still reproducing the error:

2014-08-25T05:02:21.430+0200 [repl index builder 35135] build index on: 542-110-6347.keyword_adgroup properties: { v: 1, key: { adgroup: 1 }, name: "adgroup_1", ns: "542-110-6347.keyword_adgroup", background: true }
2014-08-25T05:02:21.430+0200 [repl index builder 35135]          building index in background
2014-08-25T05:02:21.537+0200 [repl writer worker 12] CMD: drop 542-110-6347.keyword_adgroup
2014-08-25T05:02:21.537+0200 [repl writer worker 12] halting index build: { adgroup: 1 }
2014-08-25T05:02:21.537+0200 [repl writer worker 12] halting index build: { campaign: 1 }
2014-08-25T05:02:21.537+0200 [repl writer worker 12] SEVERE: Invalid access at address: 0x1c00736e3a
2014-08-25T05:02:21.542+0200 [repl writer worker 12] SEVERE: Got signal: 11 (Segmentation fault).
Backtrace:0x11e6111 0x11e54ee 0x11e55df 0x7f5367a15340 0x8d9077 0xb898a5 0xa2eaa9 0xa31380 0xa2889a 0xa2a651 0xa2bea6 0xe50dd3 0xeb9819 0xeb9d60 0x117b6ee 0x122aaa9 0x7f5367a0d182 0x7f5366d1230d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
 /usr/bin/mongod() [0x11e54ee]
 /usr/bin/mongod() [0x11e55df]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x10340) [0x7f5367a15340]
 /usr/bin/mongod(_ZN5mongo12IndexCatalog23killMatchingIndexBuildsERKNS0_17IndexKillCriteriaE+0x227) [0x8d9077]
 /usr/bin/mongod(_ZN5mongo12IndexBuilder23killMatchingIndexBuildsEPNS_10CollectionERKNS_12IndexCatalog17IndexKillCriteriaE+0x15) [0xb898a5]
 /usr/bin/mongod(_ZN5mongo7CmdDrop15stopIndexBuildsEPNS_8DatabaseERKNS_7BSONObjE+0xe9) [0xa2eaa9]
 /usr/bin/mongod(_ZN5mongo7CmdDrop3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x1c0) [0xa31380]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2889a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x19b1) [0xa2a651]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c6) [0xa2bea6]
 /usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x973) [0xe50dd3]
 /usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x749) [0xeb9819]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x50) [0xeb9d60]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x19e) [0x117b6ee]
 /usr/bin/mongod() [0x122aaa9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7f5367a0d182]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5366d1230d]
2014-08-25T08:34:21.426+0200 ***** SERVER RESTARTED *****
2014-08-25T08:34:21.429+0200 [initandlisten] MongoDB starting : pid=1422 port=27017 dbpath=/var/lib/mongodb 64-bit host=lohap2
2014-08-25T08:34:21.429+0200 [initandlisten] db version v2.6.4



 Comments   
Comment by Eric Milkie [ 28/Aug/14 ]

The following are workarounds for this issue.
Avoid the situation of multiple background index builds concurrent with collection drops:
1. by making the indexes foreground.builds
2. by synchronizing collection drop with index build completion on secondaries through the use of write concern on the background index builds

Comment by Githook User [ 28/Aug/14 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-15038 coerce correct behavior when multiple bg index builds are interrupted

Scenario: On a secondary; multiple background index builds are running on the same collection. Those builds are now yielding and the replication applier now attempts to apply a command that requires the index builds to terminate, such as a drop collection command. Such index builds' curops are flagged as killed, and then the command proceeds. The drop command asserts with BgOpInProg, which causes the replication applier to yield, wait for the index build operations to recover from yield and clean themselves up, and then runs the drop command again, expecting it to succeed this time.

When the first index build recovers from yield, it finds its cursor in the cursor cache just fine (since the drop command hasn't yet run successfully; it's yielding in the applier code). It then processes one or more documents to index. Eventually, it gets around to yielding again, at which point it notices that it has been interrupted and begins to clean itself up. Part of the cleanup code is to call dropIndex() to remove itself from the catalog. This function calls invalidateAll() on the cursor cache, which some might say is way overkill, as it invalidates all cursors in the collection, not just ones that are currently pointing at the index being dropped. The cursor invalidation includes the cursor that the second index build is using to build its index.
When the second index build recovers from yield, it doesn't find its cursor in the cache, because it has been invalidated. This causes the index build to abort in a way that the drop command was not expecting. The drop command actually does not succeed on the secondary in this case, leaving the collection existing.

The fix is, after an index build yields and regains its lock, to check for interrupt before recovering our cursor.
Branch: v2.6
https://github.com/mongodb/mongo/commit/099ed9309ff37161ef7574d5b506db97f95a575d

Comment by Eric Milkie [ 27/Aug/14 ]

Scenario: On a secondary; multiple background index builds are running on the same collection. Those builds are now yielding and the replication applier now attempts to apply a command that requires the index builds to terminate, such as a drop collection command. Such index builds' curops are flagged as killed, and then the command proceeds. The drop command asserts with BgOpInProg, which causes the replication applier to yield, wait for the index build operations to recover from yield and clean themselves up, and then runs the drop command again, expecting it to succeed this time.

When the first index build recovers from yield, it finds its cursor in the cursor cache just fine (since the drop command hasn't yet run successfully; it's yielding in the applier code). It then processes one or more documents to index. Eventually, it gets around to yielding again, at which point it notices that it has been interrupted and begins to clean itself up. Part of the cleanup code is to call dropIndex() to remove itself from the catalog. This function calls invalidateAll() on the cursor cache, which some might say is way overkill, as it invalidates all cursors in the collection, not just ones that are currently pointing at the index being dropped. The cursor invalidation includes the cursor that the second index build is using to build its index.
When the second index build recovers from yield, it doesn't find its cursor in the cache, because it has been invalidated. This causes the index build to abort in a way that the drop command was not expecting. The drop command actually does not succeed on the secondary in this case, leaving the collection existing.

The fix is, after an index build yields and regains its lock, to check for interrupt before recovering our cursor.

Comment by Ramon Fernandez Marina [ 27/Aug/14 ]

Thanks for the additional information eduherraiz. After some investigation we've found an issue that we believe is the cause of the problem. Working on a fix now.

Regards,
Ramón.

Comment by Edu Herraiz [ 27/Aug/14 ]

This is the log for the last 5 days in lohap2.

Comment by Edu Herraiz [ 27/Aug/14 ]

Ramon Fernandez, I'm very sure this node was running the binary 2.6.4, it's just a replicaset with 2 nodes and 1 arbiter. And the problem was replicated twice after the upgrade (the fix requires restart).

Comment by Ramon Fernandez Marina [ 26/Aug/14 ]

eduherraiz, do you have full logs for the affected node? Could it be that the mongod binary was upgraded to 2.6.4 while this node was still running an earlier version? The reason I'm asking is because I have not yet been able to reproduce this problem. What I see is:

2014-08-26T11:18:13.384-0400 [repl writer worker 1] CMD: drop test.foo
2014-08-26T11:18:13.384-0400 [repl writer worker 1] halting index build: { x: 1.0 }
2014-08-26T11:18:13.384-0400 [repl writer worker 1] halted 1 index build(s)
2014-08-26T11:18:13.384-0400 [repl writer worker 1] Assertion: 17348:cannot dropAllIndexes when index builds in progress

.
Can you please upload full logs for host lohap2?

Thanks,
Ramón.

Generated at Thu Feb 08 03:36:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.