[SERVER-19128] Fatal assertion during secondary index build Created: 25/Jun/15 Updated: 19/Jan/16 Resolved: 25/Sep/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 3.1.5 |
| Fix Version/s: | 3.0.9, 3.1.9 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | Mathias Stearn |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Backport Completed: | |||||||||||||||||
| Sprint: | QuInt A (10/12/15) | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||
| Description |
|
During an evergreen run, the FSM replication+sharding suite failed with the following fatal assertion:
The failing run is here and this is the actual run log. |
| Comments |
| Comment by Githook User [ 18/Jan/16 ] | ||||||||||||||||||||
|
Author: {u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'redbeard0531@gmail.com'}Message: This reverts commit 90a684ad25a86deff16f80e11e257c6ac6611683, restoring | ||||||||||||||||||||
| Comment by Githook User [ 13/Jan/16 ] | ||||||||||||||||||||
|
Author: {u'username': u'spencerjackson', u'name': u'Spencer Jackson', u'email': u'spencer.jackson@mongodb.com'}Message: Revert " This reverts commit 1d26b77d115eb39f03dffbdbaccf10e696cd4fe3. | ||||||||||||||||||||
| Comment by Githook User [ 12/Jan/16 ] | ||||||||||||||||||||
|
Author: {u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}Message: (cherry picked from commit 795a8ebd80a9f91fc1484cfdc33b6609d0bc4a35) | ||||||||||||||||||||
| Comment by Githook User [ 12/Jan/16 ] | ||||||||||||||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: (cherry picked from commit 7494d0c5a65d3e7128aa8a8857ce78dd7aea1ee6) | ||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 26/Sep/15 ] | ||||||||||||||||||||
|
Adding backport request as requested by Michael Kania above. | ||||||||||||||||||||
| Comment by Githook User [ 25/Sep/15 ] | ||||||||||||||||||||
|
Author: {u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}Message: | ||||||||||||||||||||
| Comment by Igor Canadi [ 14/Aug/15 ] | ||||||||||||||||||||
|
facepalm How about now: https://github.com/mongodb-partners/mongo/commit/f33ee2a2d91f4b532431e5c430ea317122f80849 | ||||||||||||||||||||
| Comment by Kamran K. [ 13/Aug/15 ] | ||||||||||||||||||||
|
igor, that patch looks flawed to me:
| ||||||||||||||||||||
| Comment by Igor Canadi [ 13/Aug/15 ] | ||||||||||||||||||||
|
Thanks Ramon! Really appreciate the help! BTW this is how we're trying to fix the bug currently: https://github.com/mongodb-partners/mongo/commit/01f13629260139342e6ebadbe6aea9aac4a92e57. YOLO | ||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 13/Aug/15 ] | ||||||||||||||||||||
|
Thanks for the additional info dynamike and igor. I had missed Eric's comment above on what's causing this issue. We're looking into a fix for this dev cycle which hopefully can be backported to v3.0. | ||||||||||||||||||||
| Comment by Kamran K. [ 13/Aug/15 ] | ||||||||||||||||||||
|
We enabled concurrency tests for replica sets and sharded replica sets during the 3.1.x cycle (see: We haven't yet backported those changes because the concurrency tests are undergoing a period of stabilization to allow them to run more reliably on resource-constrained test hosts via Evergreen. | ||||||||||||||||||||
| Comment by Igor Canadi [ 13/Aug/15 ] | ||||||||||||||||||||
|
Looks like the test that's failing is not run on 3.0 branch. In fact, none of the replication concurrency tests are: https://github.com/mongodb/mongo/blob/v3.0/jstests/concurrency/fsm_all_replication.js#L29 Ramon, any idea why is that so? | ||||||||||||||||||||
| Comment by Michael Kania [ 13/Aug/15 ] | ||||||||||||||||||||
|
Attached. This is 3.0.5 with RocksDB. Igor Canadi is looking if the is potentially linked to something on the RocksDB side. | ||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 13/Aug/15 ] | ||||||||||||||||||||
|
Hey dynamike, can you please post the stack trace / logs you're getting? We need to make sure you're running into this exact problem – and if you are then yes, a backport should definitely be considered. Also, which storage engine are you seeing this fassert() with? Thanks, | ||||||||||||||||||||
| Comment by Michael Kania [ 13/Aug/15 ] | ||||||||||||||||||||
|
This also appears to impact 3.0.5. Can we make the fix a backport? | ||||||||||||||||||||
| Comment by Githook User [ 28/Jul/15 ] | ||||||||||||||||||||
|
Author: {u'username': u'kkmongo', u'name': u'Kamran Khan', u'email': u'kamran.khan@mongodb.com'}Message: The workload currently triggers a fatal assertion and will be re-enabled | ||||||||||||||||||||
| Comment by Githook User [ 27/Jul/15 ] | ||||||||||||||||||||
|
Author: {u'username': u'kkmongo', u'name': u'Kamran Khan', u'email': u'kamran.khan@mongodb.com'}Message: The workload currently triggers a fatal assertion and will be re-enabled | ||||||||||||||||||||
| Comment by Eric Milkie [ 17/Jul/15 ] | ||||||||||||||||||||
|
Note that this only happens on secondaries because indexes there are built using index_builder.cpp. Index builds on primary and standalone nodes are generally handled by equivalent code in create_indexes.cpp, which handles WCE's in a different way. | ||||||||||||||||||||
| Comment by Eric Milkie [ 17/Jul/15 ] | ||||||||||||||||||||
|
I think I figured out the issue: if an index build hits a WCE, the recovery ends up dropping the index and rebuilding the entire thing from scratch. By dropping the incomplete index, we end up invalidating all cursors on the collection, which means any other background index builds on that same collection end up failing too. | ||||||||||||||||||||
| Comment by Kamran K. [ 17/Jul/15 ] | ||||||||||||||||||||
|
I don't see the "bad status from index build" message in a recent failure log:
| ||||||||||||||||||||
| Comment by Eric Milkie [ 15/Jul/15 ] | ||||||||||||||||||||
|
Waiting for this to fail again in MCI. | ||||||||||||||||||||
| Comment by Githook User [ 09/Jul/15 ] | ||||||||||||||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: add debugging output for | ||||||||||||||||||||
| Comment by Eric Milkie [ 08/Jul/15 ] | ||||||||||||||||||||
|
I think this can only happen when running multiple reindex commands in parallel. It cannot happen if you use explicit dropIndexes or createIndexes commands. | ||||||||||||||||||||
| Comment by Eric Milkie [ 08/Jul/15 ] | ||||||||||||||||||||
|
This is apparently happening because another index for same collection is being dropped while a background index build is proceeding. There should be code to prevent simultaneous index build and drops on the same collection on secondaries, but the logic must be wrong somehow. I'm looking into that. | ||||||||||||||||||||
| Comment by Eric Milkie [ 06/Jul/15 ] | ||||||||||||||||||||
|
I think the first step should be determining what value the executor is returning, if not IS_EOF. Is it FAILURE, DEAD, or something else? I don't know why we would get any of those statuses while running a background index build on a secondary. |