[SERVER-46814] Invariant on "Commit quorum is missing for index build" Created: 11/Mar/20  Updated: 06/Dec/22  Resolved: 12/Mar/20

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Luke Pearson Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-46557 Commit quorum value can get lost when... Closed
Related
related to SERVER-46557 Commit quorum value can get lost when... Closed
related to SERVER-46603 disallow empty collection index build... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:

 Comments   
Comment by Suganthi Mani [ 12/Mar/20 ]

Did a quick look at the logs. I see 2 problems. The test got hung even before the invariant failure.

Problem #1
Node1(old primary) used indexBuildsCoordinator thread pool(two phase index protocol) for building index 'JSON_1' thinking that's a non-empty collection. But, then secondaries thought it's an empty collection and used single phase index protocol, committed the index build without waiting for primary to send commitIndexBuild oplog entry from primary. So, when node1 then stepped down and on becoming secondary, it tries to vote, but then it keeps receiving "no such index build" from current primary. As a result test hangs. This bug is fixed by SERVER-46603

// Index build  that got orphaned.
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.464+0000 I  INDEX    20384   [IndexBuildsCoordinatorMongod-3] "index build: starting on {ns} properties: {descriptor} using method: {method}","attr":{"ns":"test.update_affects_indexes","descriptor":{"v":2,"key":{"JSON":1},"name":"JSON_1"},"method":"Hybrid"}
// Indicates that index build scanned zero record
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.466+0000 I  INDEX    20391   [IndexBuildsCoordinatorMongod-3] "index build: collection scan done. scanned {n} total records in {t_seconds} seconds","attr":{"n":0,"t_seconds":0}
// Node 1 waits for the votes from secondary
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.475+0000 I  STORAGE  3856203 [IndexBuildsCoordinatorMongod-3] "Index build waiting for next action before completing final phase: {buildUUID}","attr":{"buildUUID":{"uuid":{"$uuid":"0780ab91-42e3-4930-848d-7e1d8bee0803"}}}
// Secondaries able to commit without getting commit oplog from primary
[ShardedClusterFixture:job0:shard1:node2] | 2020-03-10T05:55:27.471+0000 I  INDEX    20345   [ReplWriterWorker-10] "index build: done building index {indexName} on ns {nss}","attr":{"indexName":"JSON_1","nss":"test.update_affects_indexes"}
[ShardedClusterFixture:job0:shard1:node0] | 2020-03-10T05:55:27.478+0000 I  INDEX    20345   [ReplWriterWorker-12] "index build: done building index {indexName} on ns {nss}","attr":{"indexName":"JSON_1","nss":"test.update_affects_indexes"}
// We keep getting this message
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:32:24.755+0000 I  STORAGE  3856202 [IndexBuildsCoordinatorMongod-5] "'voteCommitIndexBuild' command failed with response : {response}","attr":{"response":{"operationTime":{"$timestamp":{"t":1583821938,"i":1}},"ok":0,"errmsg":"No index build with UUID: 0780ab91-42e3-4930-848d-7e1d8bee0803","code":4,"codeName":"NoSuchKey","$gleStats":{"lastOpTime":{"$timestamp":{"t":0,"i":0}},"electionId":{"$oid":"7fffffff000000000000002d"}},"lastCommittedOpTime":{"$timestamp":{"t":1583821938,"i":1}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1583821944,"i":2}},"t":46}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1583821944,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}}}

Problem #2
Invariant failure happened because we start persist the commit quorum value only on receiving first vote. The node 1, after stepping down, it rolled back, which made the index build to abort and restart. And, as a result, the commit quorum value is lost. This bug should be addressed by SERVER-46557

Comment by Eric Milkie [ 11/Mar/20 ]

The test failed because a node hit an invariant after stepping up to primary:

ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 D2 REPL_HB  4615618 [ReplCoord-21] "Scheduling heartbeat to {target} at {dateToISOStringUTC_when}","attr":{"target":"localhost:20006","dateToISOStringUTC_when":"2020-03-10T06:34:06.825Z"}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I  NETWORK  20121   [ReplCoordExtern-2] "reconnect {} failed {errmsg}","attr":{"":"localhost:20008 failed","errmsg":""}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I  NETWORK  20127   [ReplCoordExtern-2] "DBClientCursor::init call() failed"
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I  REPL     21274   [ReplCoordExtern-2] "Error returned from oplog query (no more query restarts left): {status}","attr":{"status":"SocketException: Error while getting the next batch in the oplog fetcher :: caused by :: socket exception [CONNECT_ERROR] server [network error while attempting to run command 'isMaster' on host 'localhost:20008' ]"}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I  REPL     21094   [BackgroundSync] "Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source."
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I  SHARDING 21904   [OplogApplier-0] "The ChunkSplitter has started and will accept autosplit tasks."
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I  TXN      22454   [OplogApplier-0] "Waiting for coordinator tasks from previous term to complete"
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I  MIGRATION 22028   [OplogApplier-0] "Starting pending deletion submission thread."
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I  MIGRATION 22037   [OplogApplier-0] "Starting migration coordinator stepup recovery thread."
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I  STORAGE  20657   [OplogApplier-0] "IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I  STORAGE  20650   [OplogApplier-0] "{logPrefix}active index builds: {indexBuilds_size}","attr":{"logPrefix":"IndexBuildsCoordinator::onStepUp - ","indexBuilds_size":1}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I  STORAGE  20651   [OplogApplier-0] "{logPrefix}{replState_buildUUID}: collection: {replState_collectionUUID}; indexes: {replState_indexNames_size} [{indexNamesStr}]; method: {IndexBuildProtocol_kTwoPhase_replState_protocol_two_phase_single_phase}","attr":{"logPrefix":"IndexBuildsCoordinator::onStepUp - ","replState_buildUUID":{"uuid":{"$uuid":"0780ab91-42e3-4930-848d-7e1d8bee0803"}},"replState_collectionUUID":{"uuid":{"$uuid":"1a8132cb-9f57-47fa-bcf8-62d1016ba446"}},"replState_indexNames_size":1,"indexNamesStr":"JSON_1","IndexBuildProtocol_kTwoPhase_replState_protocol_two_phase_single_phase":"two phase"}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.831+0000 F  -        23081   [OplogApplier-0] "Invariant failure {expr} {msg} {file} {line}","attr":{"expr":"replState->commitQuorum","msg":"Commit quorum is missing for index build: 0780ab91-42e3-4930-848d-7e1d8bee0803","file":"src/mongo/db/index_builds_coordinator_mongod.cpp","line":368}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.831+0000 F  -        23082   [OplogApplier-0] "\n\n***aborting after invariant() failure\n\n"
[ShardedClusterFixture:job0:shard1:node1] {"t":{"$date":"2020-03-10T06:34:04.831+0000"},"s":"F", "c":"-",       "id":0,"ctx":"OplogApplier-0","msg":"{}","attr":{"message":"Got signal: 6 (Aborted)."}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.004+0000 I  CONTROL  31431   [OplogApplier-0] "BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"7F069540A081","b":"7F06927D8000","o":"2C32081","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.575","s+":"1E1"},{"a":"7F069540B779","b":"7F06927D8000","o":"2C33779","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7F0695408CA6","b":"7F06927D8000","o":"2C30CA6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7F068F94C7E0","b":"7F068F93D000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7F068F5DB4F5","b":"7F068F5A9000","o":"324F5","s":"gsignal","s+":"35"},{"a":"7F068F5DCCD5","b":"7F068F5A9000","o":"33CD5","s":"abort","s+":"175"},{"a":"7F06936C0CFE","b":"7F06927D8000","o":"EE8CFE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"25E"},{"a":"7F06934CB4A9","b":"7F06927D8000","o":"CF34A9","s":"_ZN5mongo28IndexBuildsCoordinatorMongod32_signalIfCommitQuorumIsSatisfiedEPNS_16OperationContextESt10shared_ptrINS_19ReplIndexBuildStateEE.cold.1155","s+":"207"},{"a":"7F069416AE3F","b":"7F06927D8000","o":"1992E3F","s":"_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mongo19ReplIndexBuildStateEEEZNS1_22IndexBuildsCoordinator8onStepUpEPNS1_16OperationContextEEUlS3_E0_E9_M_invokeERKSt9_Any_dataOS3_","s+":"12F"},{"a":"7F0694170F24","b":"7F06927D8000","o":"1998F24","s":"_ZN5mongo12_GLOBAL__N_117forEachIndexBuildERKSt6vectorISt10shared_ptrINS_19ReplIndexBuildStateEESaIS4_EENS_10StringDataESt8functionIFvS4_EE","s+":"4D4"},{"a":"7F0694171432","b":"7F06927D8000","o":"1999432","s":"_ZN5mongo22IndexBuildsCoordinator8onStepUpEPNS_16OperationContextE","s+":"132"},{"a":"7F0693837095","b":"7F06927D8000","o":"105F095","s":"_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21onTransitionToPrimaryEPNS_16OperationContextE","s+":"265"},{"a":"7F06938717C5","b":"7F06927D8000","o":"10997C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19signalDrainCompleteEPNS_16OperationContextEx","s+":"405"},{"a":"7F06938F2D2E","b":"7F06927D8000","o":"111AD2E","s":"_ZN5mongo4repl16OplogApplierImpl4_runEPNS0_11OplogBufferE","s+":"8EE"},{"a":"7F0693943682","b":"7F06927D8000","o":"116B682","s":"_ZZN5mongo15unique_functionIFvRKNS_8executor12TaskExecutor12CallbackArgsEEE8makeImplIZNS_4repl12OplogApplier7startupEvEUlS5_E_EEDaOT_EN12SpecificImpl4callES5_","s+":"F2"},{"a":"7F0694E14D33","b":"7F06927D8000","o":"263CD33","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"},{"a":"7F0694E15142","b":"7F06927D8000","o":"263D142","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"},{"a":"7F0694FA6E91","b":"7F06927D8000","o":"27CEE91","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE","s+":"141"},{"a":"7F0694FA966C","b":"7F06927D8000","o":"27D166C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv","s+":"8C"},{"a":"7F0694FAA79E","b":"7F06927D8000","o":"27D279E","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"EE"},{"a":"7F0694FAAA20","b":"7F06927D8000","o":"27D2A20","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_10ThreadPool25_startWorkerThread_inlockEvEUlvE2_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"60"},{"a":"7F06955A3A9F","b":"7F06927D8000","o":"2DCBA9F","s":"execute_native_thread_routine","s+":"F"},{"a":"7F068F944AA1","b":"7F068F93D000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7F068F691C4D","b":"7F068F5A9000","o":"E8C4D","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"4.3.4-99-g6596d5a-patch-5e67217fe3c3315e1b02d2a7","gitVersion":"6596d5a40adfe367cb408c09f90327ff64badaec","compiledModules":["enterprise"],"uname":{"sysname":"Linux","release":"2.6.32-220.el6.x86_64","version":"#1 SMP Wed Nov 9 08:03:13 EST 2011","machine":"x86_64"},"somap":[{"b":"7F06927D8000","elfType":3,"buildId":"F58B5437E1E79678928548F36B48A8BA32FB0C1E"}]}}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F069540A081","b":"7F06927D8000","o":"2C32081","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.575","s+":"1E1"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F069540B779","b":"7F06927D8000","o":"2C33779","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F0695408CA6","b":"7F06927D8000","o":"2C30CA6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F068F94C7E0","b":"7F068F93D000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F068F5DB4F5","b":"7F068F5A9000","o":"324F5","s":"gsignal","s+":"35"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F068F5DCCD5","b":"7F068F5A9000","o":"33CD5","s":"abort","s+":"175"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F06936C0CFE","b":"7F06927D8000","o":"EE8CFE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"25E"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F06934CB4A9","b":"7F06927D8000","o":"CF34A9","s":"_ZN5mongo28IndexBuildsCoordinatorMongod32_signalIfCommitQuorumIsSatisfiedEPNS_16OperationContextESt10shared_ptrINS_19ReplIndexBuildStateEE.cold.1155","s+":"207"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F069416AE3F","b":"7F06927D8000","o":"1992E3F","s":"_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mongo19ReplIndexBuildStateEEEZNS1_22IndexBuildsCoordinator8onStepUpEPNS1_16OperationContextEEUlS3_E0_E9_M_invokeERKSt9_Any_dataOS3_","s+":"12F"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F0694170F24","b":"7F06927D8000","o":"1998F24","s":"_ZN5mongo12_GLOBAL__N_117forEachIndexBuildERKSt6vectorISt10shared_ptrINS_19ReplIndexBuildStateEESaIS4_EENS_10StringDataESt8functionIFvS4_EE","s+":"4D4"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F0694171432","b":"7F06927D8000","o":"1999432","s":"_ZN5mongo22IndexBuildsCoordinator8onStepUpEPNS_16OperationContextE","s+":"132"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F0693837095","b":"7F06927D8000","o":"105F095","s":"_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21onTransitionToPrimaryEPNS_16OperationContextE","s+":"265"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F06938717C5","b":"7F06927D8000","o":"10997C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19signalDrainCompleteEPNS_16OperationContextEx","s+":"405"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F06938F2D2E","b":"7F06927D8000","o":"111AD2E","s":"_ZN5mongo4repl16OplogApplierImpl4_runEPNS0_11OplogBufferE","s+":"8EE"}}
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I  CONTROL  31427   [OplogApplier-0] "  Frame: {frame}","attr":{"frame":{"a":"7F0693943682","b":"7F06927D8000","o":"116B682","s":"_ZZN5mongo15unique_functionIFvRKNS_8executor12TaskExecutor12CallbackArgsEEE8makeImplIZNS_4repl12OplogApplier7startupEvEUlS5_E_EEDaOT_EN12SpecificImpl4callES5_","s+":"F2"}}

Generated at Thu Feb 08 05:12:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.