[SERVER-17923] Creating/dropping multiple background indexes on the same collection can cause fatal error on secondaries Created: 08/Apr/15  Updated: 04/Feb/16  Resolved: 04/Jun/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.1
Fix Version/s: 3.0.4, 3.1.4

Type: Bug Priority: Critical - P2
Reporter: Andrew Ryder (Inactive) Assignee: Eric Milkie
Resolution: Done Votes: 1
Labels: UT
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File restart.log    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-18762 Mongo 3.0 crashes while replicating m... Closed
is duplicated by SERVER-19065 dropIndexes() produces "Assertion: 17... Closed
Related
related to SERVER-20010 Segfault while dropping an index that... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Completed:
Participants:
Linked BF Score: 0

 Description   
Issue Status as of Jun 09, 2015

ISSUE SUMMARY
On a MongoDB replica set, when a secondary node is running multiple background index builds on a given collection, metadata changes to that same collection may lead to a fatal error on the secondary node.

Metadata changes that may trigger this behavior include renaming and dropping the collection, and dropping the database that contains the collection.

USER IMPACT
If a quorum of secondary nodes experience the error and shut down, the replica set will no longer have enough voting nodes operational, leading to loss of write availability.

WORKAROUNDS
Avoid collection creation, drop, and rename operations while building indexes in the background on that same collection.

AFFECTED VERSIONS
MongoDB 3.0.0 through 3.0.3.

FIX VERSION
The fix is included in the 3.0.4 production release.

Original description

Create and destroy indexes with different options, and variations, on the same collection from multiple clients and there is a chance that secondaries will fassert when applying the oplog. Thus far, no problem has been observed on the primary.

Tested using 3.0.1 enterprise. Known to occur on ubuntu 12.01 and windows 8.

Attached is the script used in each shell session. The "test.ts" collection had 250K small documents structured as {_id:ObjectId,server:int,cpu:int} however neither the structure nor quantity of documents seem to be important as other variations also trigger the fault. Background indexing appears to be a crucial requirement. The fault was originally observed on a sharded cluster with operations performed via a mongos, but a basic replica-set is all that is needed.

Sometimes the secondaries can be restarted, recover, and rejoin normally. Sometimes they fassert again on restart, persistently, until re-sync'ed. Both these results were observed in consecutive runs with no known difference to explain the different recovery result (other than timing).

Also attached is log output of an example restart (on windows) where the secondary could not recover.



 Comments   
Comment by Githook User [ 05/Jun/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-17923 prohibit database/collection actions when bg index is running
Branch: v3.0
https://github.com/mongodb/mongo/commit/b1d6f667a771fb5768978e89ae32cc862b74904f

Comment by Githook User [ 04/Jun/15 ]

Author:

{u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-17923 prohibit database/collection actions when bg index is running
Branch: master
https://github.com/mongodb/mongo/commit/bc44c26302d2f377bc0cefb4e8fbffa247d18db6

Comment by Ramon Fernandez Marina [ 08/Apr/15 ]

Full log with stack trace:

2015-04-08T14:36:36.038-0400 I COMMAND  [repl writer worker 15] CMD: dropIndexes test.ts
2015-04-08T14:36:36.038-0400 I INDEX    [repl writer worker 15] halting index build: { _id: 1.0, server: 1.0, cpu: 1.0 }
2015-04-08T14:36:36.039-0400 I INDEX    [repl writer worker 15] halted 1 index build(s)
2015-04-08T14:36:36.039-0400 W REPL     [repl writer worker 15] repl Failed command { deleteIndexes: "ts", index: "test_index" } on test with status UnknownError index not found with name [test_index] during oplog application
2015-04-08T14:36:36.041-0400 E INDEX    [repl index builder 172] IndexBuilder could not build index: Interrupted operation was interrupted
2015-04-08T14:36:36.041-0400 E INDEX    [repl index builder 173] IndexBuilder could not build index: Location28550 Unable to complete index build as the collection is no longer readable
2015-04-08T14:36:36.041-0400 I -        [repl index builder 173] Fatal Assertion 28555
2015-04-08T14:36:36.044-0400 I CONTROL  [repl index builder 173] 
 0x102a81d79 0x102a355a0 0x102a22c36 0x1025cf4f2 0x102a24839 0x102ab57a1 0x7fff8dba8268 0x7fff8dba81e5 0x7fff8dba641d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"102310000","o":"771D79"},{"b":"102310000","o":"7255A0"},{"b":"102310000","o":"712C36"},{"b":"102310000","o":"2BF4F2"},{"b":"102310000","o":"714839"},{"b":"102310000","o":"7A57A1"},{"b":"7FFF8DBA5000","o":"3268"},{"b":"7FFF8DBA5000","o":"31E5"},{"b":"7FFF8DBA5000","o":"141D"}],"processInfo":{ "mongodbVersion" : "3.0.1", "gitVersion" : "534b5a3f9d10f00cd27737fbcd951032248b5952", "uname" : { "sysname" : "Darwin", "release" : "14.1.0", "version" : "Darwin Kernel Version 14.1.0: Thu Feb 26 19:26:47 PST 2015; root:xnu-2782.10.73~1/RELEASE_X86_64", "machine" : "x86_64" }, "somap" : [ { "path" : "/usr/local/bin/mongod", "machType" : 2, "b" : "102310000", "buildId" : "BCCE8C3AD8BB31D490CB01C153674428" }, { "path" : "/usr/lib/libSystem.B.dylib", "machType" : 6, "b" : "7FFF8D4F3000", "buildId" : "90B107BCFF7432CCB1CF4E02F544D957" }, { "path" : "/usr/lib/libc++.1.dylib", "machType" : 6, "b" : "7FFF8E969000", "buildId" : "1B9530FD989B3174BB1CBDC159501710" }, { "path" : "/usr/lib/system/libcache.dylib", "machType" : 6, "b" : "7FFF8F2B1000", "buildId" : "45E9A2E799C436B2BEE30C4E11614AD1" }, { "path" : "/usr/lib/system/libcommonCrypto.dylib", "machType" : 6, "b" : "7FFF8B135000", "buildId" : "D381EBC669D831D380845A80A32CB748" }, { "path" : "/usr/lib/system/libcompiler_rt.dylib", "machType" : 6, "b" : "7FFF92525000", "buildId" : "BF8FC133EE103DA69B9092039E28678F" }, { "path" : "/usr/lib/system/libcopyfile.dylib", "machType" : 6, "b" : "7FFF8CB14000", "buildId" : "0C68D3A6ACDD3EF3991ACC82C32AB836" }, { "path" : "/usr/lib/system/libcorecrypto.dylib", "machType" : 6, "b" : "7FFF95CF4000", "buildId" : "E178980139853949B7366B3378873301" }, { "path" : "/usr/lib/system/libdispatch.dylib", "machType" : 6, "b" : "7FFF8D9C9000", "buildId" : "502CF32B669B3709886208188225E4F0" }, { "path" : "/usr/lib/system/libdyld.dylib", "machType" : 6, "b" : "7FFF92C8A000", "buildId" : "4E33E416F1D83598B8CC6863E2ECD0E6" }, { "path" : "/usr/lib/system/libkeymgr.dylib", "machType" : 6, "b" : "7FFF977DB000", "buildId" : "77845842DE703CC5BD01C3D14227CED5" }, { "path" : "/usr/lib/system/liblaunch.dylib", "machType" : 6, "b" : "7FFF8B64E000", "buildId" : "DFCDEBDF82473DC79879E7E497DDA4B4" }, { "path" : "/usr/lib/system/libmacho.dylib", "machType" : 6, "b" : "7FFF8F1F1000", "buildId" : "126CA2EDDE91308F8881B9DAEC3C63B6" }, { "path" : "/usr/lib/system/libquarantine.dylib", "machType" : 6, "b" : "7FFF8C107000", "buildId" : "DC0416272D92361CBABFA869A5C72293" }, { "path" : "/usr/lib/system/libremovefile.dylib", "machType" : 6, "b" : "7FFF93127000", "buildId" : "3485B5F46CE83C628DFD8736ED6E8531" }, { "path" : "/usr/lib/system/libsystem_asl.dylib", "machType" : 6, "b" : "7FFF95D6B000", "buildId" : "F153AC5B0542356E88C820A62CA704E2" }, { "path" : "/usr/lib/system/libsystem_blocks.dylib", "machType" : 6, "b" : "7FFF92CD7000", "buildId" : "9615D10AFCA73BE4AA1A1B195DACE1A1" }, { "path" : "/usr/lib/system/libsystem_c.dylib", "machType" : 6, "b" : "7FFF94055000", "buildId" : "199ED5EB77A13D43AA5181779CE0A742" }, { "path" : "/usr/lib/system/libsystem_configuration.dylib", "machType" : 6, "b" : "7FFF94F1D000", "buildId" : "5E14864E089A3D8485A4980B776427A8" }, { "path" : "/usr/lib/system/libsystem_coreservices.dylib", "machType" : 6, "b" : "7FFF8BF17000", "buildId" : "41B7C5785A5331C8A96FC73E030B0938" }, { "path" : "/usr/lib/system/libsystem_coretls.dylib", "machType" : 6, "b" : "7FFF8AD7A000", "buildId" : "3EAED90A7AA0323CA52BE16477981D59" }, { "path" : "/usr/lib/system/libsystem_dnssd.dylib", "machType" : 6, "b" : "7FFF8FE6F000", "buildId" : "62B70ECAE40D3C63896E7F00EC386DDB" }, { "path" : "/usr/lib/system/libsystem_info.dylib", "machType" : 6, "b" : "7FFF959E1000", "buildId" : "B85A85D585303A93B0C34DEC41F79478" }, { "path" : "/usr/lib/system/libsystem_kernel.dylib", "machType" : 6, "b" : "7FFF8BC5C000", "buildId" : "97CD7ACDEA0C3434BEFCFCD013D6BB73" }, { "path" : "/usr/lib/system/libsystem_m.dylib", "machType" : 6, "b" : "7FFF9202C000", "buildId" : "1E12AB456D9636D0A226F24D9FB0D9D6" }, { "path" : "/usr/lib/system/libsystem_malloc.dylib", "machType" : 6, "b" : "7FFF92FDE000", "buildId" : "19BCC25757173502A71F95D65AFA861B" }, { "path" : "/usr/lib/system/libsystem_network.dylib", "machType" : 6, "b" : "7FFF923F5000", "buildId" : "2EC3A005473F3C36A665F88B5BACC7F0" }, { "path" : "/usr/lib/system/libsystem_networkextension.dylib", "machType" : 6, "b" : "7FFF92F00000", "buildId" : "29AB225BD7FB30ED960065D44B9A9442" }, { "path" : "/usr/lib/system/libsystem_notify.dylib", "machType" : 6, "b" : "7FFF8AD8C000", "buildId" : "61147800F3203DAA850CBADF33855F29" }, { "path" : "/usr/lib/system/libsystem_platform.dylib", "machType" : 6, "b" : "7FFF8D77E000", "buildId" : "64E34079D7123D669CE2418624A5C040" }, { "path" : "/usr/lib/system/libsystem_pthread.dylib", "machType" : 6, "b" : "7FFF8DBA5000", "buildId" : "3103AA7F3BAE3673964947FFD7E15C97" }, { "path" : "/usr/lib/system/libsystem_sandbox.dylib", "machType" : 6, "b" : "7FFF929EF000", "buildId" : "95312E09DA28324AA084F3E574D0210E" }, { "path" : "/usr/lib/system/libsystem_secinit.dylib", "machType" : 6, "b" : "7FFF8E9BE000", "buildId" : "581DAD0F6B633A48B63B917AF799ABAA" }, { "path" : "/usr/lib/system/libsystem_stats.dylib", "machType" : 6, "b" : "7FFF95438000", "buildId" : "9B8CCF24DDDB399A92374BEC225D2E8C" }, { "path" : "/usr/lib/system/libsystem_trace.dylib", "machType" : 6, "b" : "7FFF929DF000", "buildId" : "A9E6B7D8C3273742AC5486C94218B1DF" }, { "path" : "/usr/lib/system/libunc.dylib", "machType" : 6, "b" : "7FFF96DBB000", "buildId" : "5676F7EAC1DF329FB006D2C3022B7D70" }, { "path" : "/usr/lib/system/libunwind.dylib", "machType" : 6, "b" : "7FFF90C9C000", "buildId" : "BE7E51A0B6EA3A549CCA9D88F683A6D6" }, { "path" : "/usr/lib/system/libxpc.dylib", "machType" : 6, "b" : "7FFF961D0000", "buildId" : "876216DCD5D3381E8AF949AE464E5107" }, { "path" : "/usr/lib/libobjc.A.dylib", "machType" : 6, "b" : "7FFF8C236000", "buildId" : "759E155DBC423D4E869B6F57D477177C" }, { "path" : "/usr/lib/libauto.dylib", "machType" : 6, "b" : "7FFF961F9000", "buildId" : "A260789BD4D8316A9490254767B8A5F1" }, { "path" : "/usr/lib/libc++abi.dylib", "machType" : 6, "b" : "7FFF96C75000", "buildId" : "88A22A0F87C63002BFBAAC0F2808B8B9" }, { "path" : "/usr/lib/libDiagnosticMessagesClient.dylib", "machType" : 6, "b" : "7FFF8A9EC000", "buildId" : "2EE8E4365CDC34C599595BA218D507FB" } ] }}
 mongod(_ZN5mongo15printStackTraceERNSt3__113basic_ostreamIcNS0_11char_traitsIcEEEE+0x39) [0x102a81d79]
 mongod(_ZN5mongo10logContextEPKc+0x100) [0x102a355a0]
 mongod(_ZN5mongo13fassertFailedEi+0xD6) [0x102a22c36]
 mongod(_ZN5mongo12IndexBuilder3runEv+0x702) [0x1025cf4f2]
 mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0x169) [0x102a24839]
 mongod(_ZN5boost12_GLOBAL__N_112thread_proxyEPv+0xB1) [0x102ab57a1]
 libsystem_pthread.dylib(_pthread_body+0x83) [0x7fff8dba8268]
 libsystem_pthread.dylib(_pthread_body+0x0) [0x7fff8dba81e5]
 libsystem_pthread.dylib(thread_start+0xD) [0x7fff8dba641d]
-----  END BACKTRACE  -----
2015-04-08T14:36:36.044-0400 I -        [repl index builder 173] 
 
***aborting after fassert() failure

Generated at Thu Feb 08 03:46:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.