|
Did a quick look at the logs. I see 2 problems. The test got hung even before the invariant failure.
Problem #1
Node1(old primary) used indexBuildsCoordinator thread pool(two phase index protocol) for building index 'JSON_1' thinking that's a non-empty collection. But, then secondaries thought it's an empty collection and used single phase index protocol, committed the index build without waiting for primary to send commitIndexBuild oplog entry from primary. So, when node1 then stepped down and on becoming secondary, it tries to vote, but then it keeps receiving "no such index build" from current primary. As a result test hangs. This bug is fixed by SERVER-46603
// Index build that got orphaned.
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.464+0000 I INDEX 20384 [IndexBuildsCoordinatorMongod-3] "index build: starting on {ns} properties: {descriptor} using method: {method}","attr":{"ns":"test.update_affects_indexes","descriptor":{"v":2,"key":{"JSON":1},"name":"JSON_1"},"method":"Hybrid"}
|
// Indicates that index build scanned zero record
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.466+0000 I INDEX 20391 [IndexBuildsCoordinatorMongod-3] "index build: collection scan done. scanned {n} total records in {t_seconds} seconds","attr":{"n":0,"t_seconds":0}
|
// Node 1 waits for the votes from secondary
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T05:55:27.475+0000 I STORAGE 3856203 [IndexBuildsCoordinatorMongod-3] "Index build waiting for next action before completing final phase: {buildUUID}","attr":{"buildUUID":{"uuid":{"$uuid":"0780ab91-42e3-4930-848d-7e1d8bee0803"}}}
|
// Secondaries able to commit without getting commit oplog from primary
|
[ShardedClusterFixture:job0:shard1:node2] | 2020-03-10T05:55:27.471+0000 I INDEX 20345 [ReplWriterWorker-10] "index build: done building index {indexName} on ns {nss}","attr":{"indexName":"JSON_1","nss":"test.update_affects_indexes"}
|
[ShardedClusterFixture:job0:shard1:node0] | 2020-03-10T05:55:27.478+0000 I INDEX 20345 [ReplWriterWorker-12] "index build: done building index {indexName} on ns {nss}","attr":{"indexName":"JSON_1","nss":"test.update_affects_indexes"}
|
// We keep getting this message
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:32:24.755+0000 I STORAGE 3856202 [IndexBuildsCoordinatorMongod-5] "'voteCommitIndexBuild' command failed with response : {response}","attr":{"response":{"operationTime":{"$timestamp":{"t":1583821938,"i":1}},"ok":0,"errmsg":"No index build with UUID: 0780ab91-42e3-4930-848d-7e1d8bee0803","code":4,"codeName":"NoSuchKey","$gleStats":{"lastOpTime":{"$timestamp":{"t":0,"i":0}},"electionId":{"$oid":"7fffffff000000000000002d"}},"lastCommittedOpTime":{"$timestamp":{"t":1583821938,"i":1}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1583821944,"i":2}},"t":46}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1583821944,"i":2}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}}}}
|
Problem #2
Invariant failure happened because we start persist the commit quorum value only on receiving first vote. The node 1, after stepping down, it rolled back, which made the index build to abort and restart. And, as a result, the commit quorum value is lost. This bug should be addressed by SERVER-46557
|
|
The test failed because a node hit an invariant after stepping up to primary:
ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 D2 REPL_HB 4615618 [ReplCoord-21] "Scheduling heartbeat to {target} at {dateToISOStringUTC_when}","attr":{"target":"localhost:20006","dateToISOStringUTC_when":"2020-03-10T06:34:06.825Z"}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I NETWORK 20121 [ReplCoordExtern-2] "reconnect {} failed {errmsg}","attr":{"":"localhost:20008 failed","errmsg":""}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I NETWORK 20127 [ReplCoordExtern-2] "DBClientCursor::init call() failed"
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I REPL 21274 [ReplCoordExtern-2] "Error returned from oplog query (no more query restarts left): {status}","attr":{"status":"SocketException: Error while getting the next batch in the oplog fetcher :: caused by :: socket exception [CONNECT_ERROR] server [network error while attempting to run command 'isMaster' on host 'localhost:20008' ]"}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.825+0000 I REPL 21094 [BackgroundSync] "Replication producer stopped after oplog fetcher finished returning a batch from our sync source. Abandoning this batch of oplog entries and re-evaluating our sync source."
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I SHARDING 21904 [OplogApplier-0] "The ChunkSplitter has started and will accept autosplit tasks."
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I TXN 22454 [OplogApplier-0] "Waiting for coordinator tasks from previous term to complete"
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I MIGRATION 22028 [OplogApplier-0] "Starting pending deletion submission thread."
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.829+0000 I MIGRATION 22037 [OplogApplier-0] "Starting migration coordinator stepup recovery thread."
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I STORAGE 20657 [OplogApplier-0] "IndexBuildsCoordinator::onStepUp - this node is stepping up to primary"
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I STORAGE 20650 [OplogApplier-0] "{logPrefix}active index builds: {indexBuilds_size}","attr":{"logPrefix":"IndexBuildsCoordinator::onStepUp - ","indexBuilds_size":1}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.830+0000 I STORAGE 20651 [OplogApplier-0] "{logPrefix}{replState_buildUUID}: collection: {replState_collectionUUID}; indexes: {replState_indexNames_size} [{indexNamesStr}]; method: {IndexBuildProtocol_kTwoPhase_replState_protocol_two_phase_single_phase}","attr":{"logPrefix":"IndexBuildsCoordinator::onStepUp - ","replState_buildUUID":{"uuid":{"$uuid":"0780ab91-42e3-4930-848d-7e1d8bee0803"}},"replState_collectionUUID":{"uuid":{"$uuid":"1a8132cb-9f57-47fa-bcf8-62d1016ba446"}},"replState_indexNames_size":1,"indexNamesStr":"JSON_1","IndexBuildProtocol_kTwoPhase_replState_protocol_two_phase_single_phase":"two phase"}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.831+0000 F - 23081 [OplogApplier-0] "Invariant failure {expr} {msg} {file} {line}","attr":{"expr":"replState->commitQuorum","msg":"Commit quorum is missing for index build: 0780ab91-42e3-4930-848d-7e1d8bee0803","file":"src/mongo/db/index_builds_coordinator_mongod.cpp","line":368}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:04.831+0000 F - 23082 [OplogApplier-0] "\n\n***aborting after invariant() failure\n\n"
|
[ShardedClusterFixture:job0:shard1:node1] {"t":{"$date":"2020-03-10T06:34:04.831+0000"},"s":"F", "c":"-", "id":0,"ctx":"OplogApplier-0","msg":"{}","attr":{"message":"Got signal: 6 (Aborted)."}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.004+0000 I CONTROL 31431 [OplogApplier-0] "BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"7F069540A081","b":"7F06927D8000","o":"2C32081","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.575","s+":"1E1"},{"a":"7F069540B779","b":"7F06927D8000","o":"2C33779","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"7F0695408CA6","b":"7F06927D8000","o":"2C30CA6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"},{"a":"7F068F94C7E0","b":"7F068F93D000","o":"F7E0","s":"_L_unlock_16","s+":"2D"},{"a":"7F068F5DB4F5","b":"7F068F5A9000","o":"324F5","s":"gsignal","s+":"35"},{"a":"7F068F5DCCD5","b":"7F068F5A9000","o":"33CD5","s":"abort","s+":"175"},{"a":"7F06936C0CFE","b":"7F06927D8000","o":"EE8CFE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"25E"},{"a":"7F06934CB4A9","b":"7F06927D8000","o":"CF34A9","s":"_ZN5mongo28IndexBuildsCoordinatorMongod32_signalIfCommitQuorumIsSatisfiedEPNS_16OperationContextESt10shared_ptrINS_19ReplIndexBuildStateEE.cold.1155","s+":"207"},{"a":"7F069416AE3F","b":"7F06927D8000","o":"1992E3F","s":"_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mongo19ReplIndexBuildStateEEEZNS1_22IndexBuildsCoordinator8onStepUpEPNS1_16OperationContextEEUlS3_E0_E9_M_invokeERKSt9_Any_dataOS3_","s+":"12F"},{"a":"7F0694170F24","b":"7F06927D8000","o":"1998F24","s":"_ZN5mongo12_GLOBAL__N_117forEachIndexBuildERKSt6vectorISt10shared_ptrINS_19ReplIndexBuildStateEESaIS4_EENS_10StringDataESt8functionIFvS4_EE","s+":"4D4"},{"a":"7F0694171432","b":"7F06927D8000","o":"1999432","s":"_ZN5mongo22IndexBuildsCoordinator8onStepUpEPNS_16OperationContextE","s+":"132"},{"a":"7F0693837095","b":"7F06927D8000","o":"105F095","s":"_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21onTransitionToPrimaryEPNS_16OperationContextE","s+":"265"},{"a":"7F06938717C5","b":"7F06927D8000","o":"10997C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19signalDrainCompleteEPNS_16OperationContextEx","s+":"405"},{"a":"7F06938F2D2E","b":"7F06927D8000","o":"111AD2E","s":"_ZN5mongo4repl16OplogApplierImpl4_runEPNS0_11OplogBufferE","s+":"8EE"},{"a":"7F0693943682","b":"7F06927D8000","o":"116B682","s":"_ZZN5mongo15unique_functionIFvRKNS_8executor12TaskExecutor12CallbackArgsEEE8makeImplIZNS_4repl12OplogApplier7startupEvEUlS5_E_EEDaOT_EN12SpecificImpl4callES5_","s+":"F2"},{"a":"7F0694E14D33","b":"7F06927D8000","o":"263CD33","s":"_ZN5mongo8executor22ThreadPoolTaskExecutor11runCallbackESt10shared_ptrINS1_13CallbackStateEE","s+":"113"},{"a":"7F0694E15142","b":"7F06927D8000","o":"263D142","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_8executor22ThreadPoolTaskExecutor23scheduleIntoPool_inlockEPNSt7__cxx114listISt10shared_ptrINS6_13CallbackStateEESaISB_EEERKSt14_List_iteratorISB_ESI_St11unique_lockINS_12latch_detail5LatchEEEUlT_E1_EEDaOSN_EN12SpecificImpl4callEOS1_","s+":"A2"},{"a":"7F0694FA6E91","b":"7F06927D8000","o":"27CEE91","s":"_ZN5mongo10ThreadPool10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE","s+":"141"},{"a":"7F0694FA966C","b":"7F06927D8000","o":"27D166C","s":"_ZN5mongo10ThreadPool13_consumeTasksEv","s+":"8C"},{"a":"7F0694FAA79E","b":"7F06927D8000","o":"27D279E","s":"_ZN5mongo10ThreadPool17_workerThreadBodyEPS0_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"EE"},{"a":"7F0694FAAA20","b":"7F06927D8000","o":"27D2A20","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_10ThreadPool25_startWorkerThread_inlockEvEUlvE2_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"60"},{"a":"7F06955A3A9F","b":"7F06927D8000","o":"2DCBA9F","s":"execute_native_thread_routine","s+":"F"},{"a":"7F068F944AA1","b":"7F068F93D000","o":"7AA1","s":"start_thread","s+":"D1"},{"a":"7F068F691C4D","b":"7F068F5A9000","o":"E8C4D","s":"clone","s+":"6D"}],"processInfo":{"mongodbVersion":"4.3.4-99-g6596d5a-patch-5e67217fe3c3315e1b02d2a7","gitVersion":"6596d5a40adfe367cb408c09f90327ff64badaec","compiledModules":["enterprise"],"uname":{"sysname":"Linux","release":"2.6.32-220.el6.x86_64","version":"#1 SMP Wed Nov 9 08:03:13 EST 2011","machine":"x86_64"},"somap":[{"b":"7F06927D8000","elfType":3,"buildId":"F58B5437E1E79678928548F36B48A8BA32FB0C1E"}]}}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F069540A081","b":"7F06927D8000","o":"2C32081","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.575","s+":"1E1"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F069540B779","b":"7F06927D8000","o":"2C33779","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F0695408CA6","b":"7F06927D8000","o":"2C30CA6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP7siginfoPv","s+":"66"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F068F94C7E0","b":"7F068F93D000","o":"F7E0","s":"_L_unlock_16","s+":"2D"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F068F5DB4F5","b":"7F068F5A9000","o":"324F5","s":"gsignal","s+":"35"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F068F5DCCD5","b":"7F068F5A9000","o":"33CD5","s":"abort","s+":"175"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.005+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F06936C0CFE","b":"7F06927D8000","o":"EE8CFE","s":"_ZN5mongo22invariantFailedWithMsgEPKcRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES1_j","s+":"25E"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F06934CB4A9","b":"7F06927D8000","o":"CF34A9","s":"_ZN5mongo28IndexBuildsCoordinatorMongod32_signalIfCommitQuorumIsSatisfiedEPNS_16OperationContextESt10shared_ptrINS_19ReplIndexBuildStateEE.cold.1155","s+":"207"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F069416AE3F","b":"7F06927D8000","o":"1992E3F","s":"_ZNSt17_Function_handlerIFvSt10shared_ptrIN5mongo19ReplIndexBuildStateEEEZNS1_22IndexBuildsCoordinator8onStepUpEPNS1_16OperationContextEEUlS3_E0_E9_M_invokeERKSt9_Any_dataOS3_","s+":"12F"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F0694170F24","b":"7F06927D8000","o":"1998F24","s":"_ZN5mongo12_GLOBAL__N_117forEachIndexBuildERKSt6vectorISt10shared_ptrINS_19ReplIndexBuildStateEESaIS4_EENS_10StringDataESt8functionIFvS4_EE","s+":"4D4"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F0694171432","b":"7F06927D8000","o":"1999432","s":"_ZN5mongo22IndexBuildsCoordinator8onStepUpEPNS_16OperationContextE","s+":"132"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F0693837095","b":"7F06927D8000","o":"105F095","s":"_ZN5mongo4repl39ReplicationCoordinatorExternalStateImpl21onTransitionToPrimaryEPNS_16OperationContextE","s+":"265"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F06938717C5","b":"7F06927D8000","o":"10997C5","s":"_ZN5mongo4repl26ReplicationCoordinatorImpl19signalDrainCompleteEPNS_16OperationContextEx","s+":"405"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F06938F2D2E","b":"7F06927D8000","o":"111AD2E","s":"_ZN5mongo4repl16OplogApplierImpl4_runEPNS0_11OplogBufferE","s+":"8EE"}}
|
[ShardedClusterFixture:job0:shard1:node1] | 2020-03-10T06:34:05.006+0000 I CONTROL 31427 [OplogApplier-0] " Frame: {frame}","attr":{"frame":{"a":"7F0693943682","b":"7F06927D8000","o":"116B682","s":"_ZZN5mongo15unique_functionIFvRKNS_8executor12TaskExecutor12CallbackArgsEEE8makeImplIZNS_4repl12OplogApplier7startupEvEUlS5_E_EEDaOT_EN12SpecificImpl4callES5_","s+":"F2"}}
|
|