[SERVER-19128] Fatal assertion during secondary index build Created: 25/Jun/15  Updated: 19/Jan/16  Resolved: 25/Sep/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 3.1.5
Fix Version/s: 3.0.9, 3.1.9

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Mathias Stearn
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File fassert_bk_index    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-20725 Mongo 3.0.6 secondary crashes when wh... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Sprint: QuInt A (10/12/15)
Participants:
Linked BF Score: 0

 Description   

During an evergreen run, the FSM replication+sharding suite failed with the following fatal assertion:

[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.558 0000  m31101| 2015-06-25T05:37:56.558 0000 I INDEX    [repl index builder 283] build index on: db66.reindex_background_0 properties: { v: 1, key: { integer: 1.0 }, name: "integer_1", ns: "db66.reindex_background_0", background: true }
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.577 0000  m31102| 2015-06-25T05:37:56.576 0000 E INDEX    [repl index builder 278] IndexBuilder could not build index: Location28550 Unable to complete index build as the collection is no longer readable
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.577 0000  m31102| 2015-06-25T05:37:56.576 0000 I -        [repl index builder 278] Fatal Assertion 28555
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.602 0000  m31102| 2015-06-25T05:37:56.601 0000 I CONTROL  [repl index builder 278]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.602 0000  m31102|  0x1060af6 0x1018068 0x100571d 0xb5d2f6 0x100a09a 0x7fcf4d05fbf0 0x7fcf4c87c182 0x7fcf4c5a947d
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.602 0000  m31102| ----- BEGIN BACKTRACE -----
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102| {"backtrace":[{"b":"400000","o":"C60AF6"},{"b":"400000","o":"C18068"},{"b":"400000","o":"C0571D"},{"b":"400000","o":"75D2F6"},{"b":"400000","o":"C0A09A"},{"b":"7FCF4CFAE000","o":"B1BF0"},{"b":"7FCF4C874000","o":"8182"},{"b":"7FCF4C4AF000","o":"FA47D"}],"processInfo":{ "mongodbVersion" : "3.1.5-pre-", "gitVersion" : "2e7eb5ed6f33fe179f6a58d2ab8d5b12e11030aa", "compiledModules" : [ "rocksdb" ], "uname" : { "sysname" : "Linux", "release" : "3.13.0-24-generic", "version" : "#46-Ubuntu SMP Thu Apr 10 19:11:08 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "0B8F75910268F6A1873E9D9D538A66C0C5566E91" }, { "b" : "7FFF5CAC7000", "elfType" : 3, "buildId" : "6755FAD2CADACDF1667E5B57FF1EDFC28DD1C976" }, { "b" : "7FCF4D8CE000", "path" : "/lib/x86_64-linux-gnu/libz.so.1", "elfType" : 3, "buildId" : "7EE9BC42787A0F9A793BDBFCB0671FD12C52BC2A" }, { "b" : "7FCF4D6BE000", "path" : "/lib/x86_64-linux-gnu/libbz2.so.1.0", "elfType" : 3, "buildId" : "E1031DDBFFE20367E874B7093EEC0C8D9F3B43F6" }, { "b" : "7FCF4D4B6000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FCF4D2B2000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FCF4CFAE000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc  .so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7FCF4CCA8000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FCF4CA92000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FCF4C874000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FCF4C4AF000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FCF4DAE7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  mongod(mongo::printStackTrace(std::ostream&) 0x26) [0x1060af6]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  mongod(mongo::logContext(char const*) 0x138) [0x1018068]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  mongod(mongo::fassertFailed(int) 0x4D) [0x100571d]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  mongod(mongo::IndexBuilder::run() 0x616) [0xb5d2f6]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  mongod(mongo::BackgroundJob::jobBody() 0x14A) [0x100a09a]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  libstdc  .so.6( 0xB1BF0) [0x7fcf4d05fbf0]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  libpthread.so.0( 0x8182) [0x7fcf4c87c182]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.604 0000  m31102|  libc.so.6(clone 0x6D) [0x7fcf4c5a947d]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.605 0000  m31102| -----  END BACKTRACE  -----
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.605 0000  m31102| 2015-06-25T05:37:56.601 0000 I -        [repl index builder 278]
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.605 0000  m31102|
[js_test:fsm_all_sharded_replication] 2015-06-25T05:37:56.605 0000  m31102| ***aborting after fassert() failure

The failing run is here and this is the actual run log.



 Comments   
Comment by Githook User [ 18/Jan/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'redbeard0531@gmail.com'}

Message: SERVER-19128 Clean up interaction of background index builds and cursor killing

This reverts commit 90a684ad25a86deff16f80e11e257c6ac6611683, restoring
1d26b77d115eb39f03dffbdbaccf10e696cd4fe3.
Branch: v3.0
https://github.com/mongodb/mongo/commit/c44a3ecdffd5c98fdd2daa6e46f52c50ff594576

Comment by Githook User [ 13/Jan/16 ]

Author:

{u'username': u'spencerjackson', u'name': u'Spencer Jackson', u'email': u'spencer.jackson@mongodb.com'}

Message: Revert "SERVER-19128 Clean up interaction of background index builds and cursor killing"

This reverts commit 1d26b77d115eb39f03dffbdbaccf10e696cd4fe3.
Branch: v3.0
https://github.com/mongodb/mongo/commit/90a684ad25a86deff16f80e11e257c6ac6611683

Comment by Githook User [ 12/Jan/16 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19128 Clean up interaction of background index builds and cursor killing

(cherry picked from commit 795a8ebd80a9f91fc1484cfdc33b6609d0bc4a35)
Branch: v3.0
https://github.com/mongodb/mongo/commit/1d26b77d115eb39f03dffbdbaccf10e696cd4fe3

Comment by Githook User [ 12/Jan/16 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-19128 add debugging output

(cherry picked from commit 7494d0c5a65d3e7128aa8a8857ce78dd7aea1ee6)
Branch: v3.0
https://github.com/mongodb/mongo/commit/3a9c197fa16a558d270dc7c3ccde088e4deb67e3

Comment by Ramon Fernandez Marina [ 26/Sep/15 ]

Adding backport request as requested by Michael Kania above.

Comment by Githook User [ 25/Sep/15 ]

Author:

{u'username': u'RedBeard0531', u'name': u'Mathias Stearn', u'email': u'mathias@10gen.com'}

Message: SERVER-19128 Clean up interaction of background index builds and cursor killing
Branch: master
https://github.com/mongodb/mongo/commit/795a8ebd80a9f91fc1484cfdc33b6609d0bc4a35

Comment by Igor Canadi [ 14/Aug/15 ]

facepalm

How about now: https://github.com/mongodb-partners/mongo/commit/f33ee2a2d91f4b532431e5c430ea317122f80849

Comment by Kamran K. [ 13/Aug/15 ]

igor, that patch looks flawed to me:

212    if (status.code() != ErrorCodes::WriteConflict)
213        return status;
214
215    if (status.code() == ErrorCodes::WriteConflict) {
216        LOG(2) << "WriteConflictException while creating index in IndexBuilder, retrying.";
 
--> Lines 217+ seem unreachable by virtue of the conditional on line 212
 
217    } else if (status.code() == 28550) {
218        LOG(2) << "Status 28550 when building an index in IndexBuilder, retrying";
219    } else {
220        return status;

Comment by Igor Canadi [ 13/Aug/15 ]

Thanks Ramon! Really appreciate the help!

BTW this is how we're trying to fix the bug currently: https://github.com/mongodb-partners/mongo/commit/01f13629260139342e6ebadbe6aea9aac4a92e57. YOLO

Comment by Ramon Fernandez Marina [ 13/Aug/15 ]

Thanks for the additional info dynamike and igor. I had missed Eric's comment above on what's causing this issue. We're looking into a fix for this dev cycle which hopefully can be backported to v3.0.

Comment by Kamran K. [ 13/Aug/15 ]

We enabled concurrency tests for replica sets and sharded replica sets during the 3.1.x cycle (see: SERVER-18495 and SERVER-18797).

We haven't yet backported those changes because the concurrency tests are undergoing a period of stabilization to allow them to run more reliably on resource-constrained test hosts via Evergreen.

Comment by Igor Canadi [ 13/Aug/15 ]

Looks like the test that's failing is not run on 3.0 branch. In fact, none of the replication concurrency tests are: https://github.com/mongodb/mongo/blob/v3.0/jstests/concurrency/fsm_all_replication.js#L29

Ramon, any idea why is that so?

Comment by Michael Kania [ 13/Aug/15 ]

Attached. This is 3.0.5 with RocksDB. Igor Canadi is looking if the is potentially linked to something on the RocksDB side.

Comment by Ramon Fernandez Marina [ 13/Aug/15 ]

Hey dynamike, can you please post the stack trace / logs you're getting? We need to make sure you're running into this exact problem – and if you are then yes, a backport should definitely be considered. Also, which storage engine are you seeing this fassert() with?

Thanks,
Ramón.

Comment by Michael Kania [ 13/Aug/15 ]

This also appears to impact 3.0.5. Can we make the fix a backport?

Comment by Githook User [ 28/Jul/15 ]

Author:

{u'username': u'kkmongo', u'name': u'Kamran Khan', u'email': u'kamran.khan@mongodb.com'}

Message: SERVER-19128 Disable the reindex/bgindex FSM workload for replicated clusters

The workload currently triggers a fatal assertion and will be re-enabled
when the server bug is fixed.
Branch: master
https://github.com/mongodb/mongo/commit/6fcca346efc8817c557ac45041700318efb863d7

Comment by Githook User [ 27/Jul/15 ]

Author:

{u'username': u'kkmongo', u'name': u'Kamran Khan', u'email': u'kamran.khan@mongodb.com'}

Message: SERVER-19128 Disable the reindex/bgindex FSM workload for replicated clusters

The workload currently triggers a fatal assertion and will be re-enabled
when the server bug is fixed.
Branch: master
https://github.com/mongodb/mongo/commit/6fcca346efc8817c557ac45041700318efb863d7

Comment by Eric Milkie [ 17/Jul/15 ]

Note that this only happens on secondaries because indexes there are built using index_builder.cpp. Index builds on primary and standalone nodes are generally handled by equivalent code in create_indexes.cpp, which handles WCE's in a different way.

Comment by Eric Milkie [ 17/Jul/15 ]

I think I figured out the issue: if an index build hits a WCE, the recovery ends up dropping the index and rebuilding the entire thing from scratch. By dropping the incomplete index, we end up invalidating all cursors on the collection, which means any other background index builds on that same collection end up failing too.
Arguably, there is no need to invalidate cursors when dropping incomplete or hidden indexes, since there is no possible way that any cursors could be using them. However, we might be able to fix this by handling the WCE in a better way – retrying the operation rather than restarting the entire index build again.

Comment by Kamran K. [ 17/Jul/15 ]

I don't see the "bad status from index build" message in a recent failure log:

[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.746+0000  m31202| 2015-07-16T19:55:13.737+0000 E INDEX    [repl index builder 258] IndexBuilder could not build index: Location28550 Unable to complete index build as the collection is no longer readable
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.747+0000  m31202| 2015-07-16T19:55:13.737+0000 I -        [repl index builder 258] Fatal Assertion 28555
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.748+0000  m31202| 2015-07-16T19:55:13.739+0000 I CONTROL  [repl index builder 258]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.748+0000  m31202|  0x146ba52 0x141dff4 0x14084b5 0xed31ac 0x140d267 0x19b62f1 0x7fffff270dba 0x7fffff2710d0
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.748+0000  m31202| ----- BEGIN BACKTRACE -----
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.750+0000  m31202| {"backtrace":[{"b":"400000","o":"106BA52"},{"b":"400000","o":"101DFF4"},{"b":"400000","o":"10084B5"},{"b":"400000","o":"AD31AC"},{"b":"400000","o":"100D267"},{"b":"400000","o":"15B62F1"},{"b":"7FFFFF160000","o":"110DBA"},{"b":"7FFFFF160000","o":"1110D0"}],"processInfo":{ "mongodbVersion" : "3.1.6-pre-", "gitVersion" : "518116041128c6e334806141c1e69a1221e8939a", "compiledModules" : [] }}
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.751+0000  m31202|  mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x146ba52]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.751+0000  m31202|  mongod(_ZN5mongo10logContextEPKc+0x134) [0x141dff4]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.752+0000  m31202|  mongod(_ZN5mongo13fassertFailedEi+0x55) [0x14084b5]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.752+0000  m31202|  mongod(_ZN5mongo12IndexBuilder3runEv+0x6FC) [0xed31ac]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.752+0000  m31202|  mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x157) [0x140d267]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.753+0000  m31202|  mongod(execute_native_thread_routine+0x21) [0x19b62f1]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.754+0000  m31202|  libc.so.1(_thrp_setup+0x8A) [0x7fffff270dba]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.754+0000  m31202|  libc.so.1(_lwp_start+0x0) [0x7fffff2710d0]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.755+0000  m31202| -----  END BACKTRACE  -----
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.756+0000  m31202| 2015-07-16T19:55:13.739+0000 I -        [repl index builder 258]
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.756+0000  m31202|
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.756+0000  m31202| ***aborting after fassert() failure
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.757+0000  m31202|
[js_test:fsm_all_sharded_replication_legacy_config_servers] 2015-07-16T19:55:13.757+0000  m31202|

https://evergreen.mongodb.com/task/mongodb_mongo_master_solaris_64_bit_concurrency_sharded_WT_518116041128c6e334806141c1e69a1221e8939a_15_07_16_18_07_44

Comment by Eric Milkie [ 15/Jul/15 ]

Waiting for this to fail again in MCI.

Comment by Githook User [ 09/Jul/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: add debugging output for SERVER-19128
Branch: master
https://github.com/mongodb/mongo/commit/7494d0c5a65d3e7128aa8a8857ce78dd7aea1ee6

Comment by Eric Milkie [ 08/Jul/15 ]

I think this can only happen when running multiple reindex commands in parallel. It cannot happen if you use explicit dropIndexes or createIndexes commands.

Comment by Eric Milkie [ 08/Jul/15 ]

This is apparently happening because another index for same collection is being dropped while a background index build is proceeding. There should be code to prevent simultaneous index build and drops on the same collection on secondaries, but the logic must be wrong somehow. I'm looking into that.

Comment by Eric Milkie [ 06/Jul/15 ]

I think the first step should be determining what value the executor is returning, if not IS_EOF. Is it FAILURE, DEAD, or something else? I don't know why we would get any of those statuses while running a background index build on a secondary.

Generated at Thu Feb 08 03:49:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.