[SERVER-55676] [SBE][replica_sets] Tests trigger fatal assertion: NotPrimaryOrSecondary: Oplog collection reads are not allowed while in the rollback or startup state Created: 31/Mar/21  Updated: 29/Oct/23  Resolved: 14/Apr/21

Status: Closed
Project: Core Server
Component/s: Query Execution, Replication
Affects Version/s: None
Fix Version/s: 5.0.0-rc0

Type: Task Priority: Major - P3
Reporter: Kyle Suarez Assignee: Mindaugas Malinauskas
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-56086 [SBE][ese] Hang in decrypt_tool.js Closed
is duplicated by SERVER-56122 [SBE] assert.soon() timeout in rollba... Closed
is duplicated by SERVER-55669 [SBE][replica_sets] Hang because roll... Closed
is duplicated by SERVER-55671 [SBE][replica_sets] System hangs duri... Closed
Related
related to SERVER-55009 Enable replica_sets suites against SB... Closed
Backwards Compatibility: Fully Compatible
Sprint: Query Execution 2021-04-19
Participants:

 Description   

In these tests either the ReplCoordExtern-1/2, BackgroundSync or ReplWriterWorker thread hits this fatal assertion. Most of them have this failure with a stack trace involving TransactionHistoryIterator:

[ReplWriterWorker-0] "Fatal assertion","attr":{"msgid":31145,"error":"NotPrimaryOrSecondary: Oplog collection reads are not allowed while in the rollback or startup state.","file":"src/mongo/db/transaction_history_iterator.cpp","line":139}                               
[ReplWriterWorker-0] "\n\n***aborting after fassert() failure\n\n"
[ReplWriterWorker-0] "Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}
[ReplWriterWorker-0] "BACKTRACE","attr":{"bt":{"backtrace":[{"a":"7F3C396B0572","b":"7F3C394BD000","o":"1F3572","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.630","s+":"212"},{"a":"7F3C396B1FA9","b":
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C396B0572","b":"7F3C394BD000","o":"1F3572","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.630","s+":"212"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C396B1FA9","b":"7F3C394BD000","o":"1F4FA9","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C396AD4DF","b":"7F3C394BD000","o":"1F04DF","s":"abruptQuit","s+":"6F"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C37ACBD80","b":"7F3C37AB9000","o":"12D80","s":"funlockfile","s+":"50"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C373AA93F","b":"7F3C37373000","o":"3793F","s":"gsignal","s+":"10F"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C37394C95","b":"7F3C37373000","o":"21C95","s":"abort","s+":"127"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C395DCCA2","b":"7F3C394BD000","o":"11FCA2","s":"_ZN5mongo35fassertFailedWithStatusWithLocationEiRKNS_6StatusEPKcj","s+":"178"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C351E652A","b":"7F3C351B6000","o":"3052A","s":"_ZN5mongo26TransactionHistoryIterator17nextFatalOnErrorsEPNS_16OperationContextE.cold.702","s+":"30"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C338609CA","b":"7F3C33816000","o":"4A9CA","s":"_ZN5mongo40_readTransactionOperationsFromOplogChainEPNS_16OperationContextERKNS_4repl10OplogEntryERKSt6vectorIPS3_SaIS7_EEb","s+":"16A"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C338617B2","b":"7F3C33816000","o":"4B7B2","s":"_ZN5mongo39readTransactionOperationsFromOplogChainEPNS_16OperationContextERKNS_4repl10OplogEntryERKSt6vectorIPS3_SaIS7_EE","s+":"32"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C338621D7","b":"7F3C33816000","o":"4C1D7","s":"_ZN5mongo22applyCommitTransactionEPNS_16OperationContextERKNS_4repl10OplogEntryENS2_16OplogApplication4ModeE","s+":"147"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C3384818C","b":"7F3C33816000","o":"3218C","s":"_ZNSt17_Function_handlerIFN5mongo6StatusEPNS0_16OperationContextERKNS0_4repl10OplogEntryENS4_16OplogApplication4ModeEENS4_12_GLOBAL__N_1UlS3_S7_S9_E22_EE9_M_invokeERKSt9
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C33854751","b":"7F3C33816000","o":"3E751","s":"_ZN5mongo4repl19applyCommand_inlockEPNS_16OperationContextERKNS0_10OplogEntryENS0_16OplogApplication4ModeE","s+":"B51"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C355AB360","b":"7F3C35581000","o":"2A360","s":"_ZN5mongo4repl17OplogApplierUtils37applyOplogEntryOrGroupedInsertsCommonEPNS_16OperationContextERKNS0_26OplogEntryOrGroupedInsertsENS0_16OplogApplication4ModeESt8functio
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C3559DD0A","b":"7F3C35581000","o":"1CD0A","s":"_ZN5mongo4repl31applyOplogEntryOrGroupedInsertsEPNS_16OperationContextERKNS0_26OplogEntryOrGroupedInsertsENS0_16OplogApplication4ModeE","s+":"BA"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C355A07EC","b":"7F3C35581000","o":"1F7EC","s":"_ZNSt17_Function_handlerIFN5mongo6StatusEPNS0_16OperationContextERKNS0_4repl26OplogEntryOrGroupedInsertsENS4_16OplogApplication4ModeEEPSA_E9_M_invokeERKSt9_Any_dataOS3_S
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C355A9CBE","b":"7F3C35581000","o":"28CBE","s":"_ZN5mongo4repl17OplogApplierUtils21applyOplogBatchCommonEPNS_16OperationContextEPSt6vectorIPKNS0_10OplogEntryESaIS7_EENS0_16OplogApplication4ModeEbSt8functionIFNS_6Statu
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C3559D8E6","b":"7F3C35581000","o":"1C8E6","s":"_ZN5mongo4repl16OplogApplierImpl24applyOplogBatchPerWorkerEPNS_16OperationContextEPSt6vectorIPKNS0_10OplogEntryESaIS7_EEPS4_INS_16MultikeyPathInfoESaISB_EE","s+":"F6"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C3559C2B5","b":"7F3C35581000","o":"1B2B5","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_4repl16OplogApplierImpl16_applyOplogBatchEPNS_16OperationContextESt6vectorINS5_10OplogEntryESaISA_EEEUlT_E3_EEDa
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C314C6AA2","b":"7F3C314B5000","o":"11AA2","s":"_ZN5mongo10ThreadPool4Impl10_doOneTaskEPSt11unique_lockINS_12latch_detail5LatchEE","s+":"122"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C314C6D43","b":"7F3C314B5000","o":"11D43","s":"_ZN5mongo10ThreadPool4Impl13_consumeTasksEv","s+":"83"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C314C7835","b":"7F3C314B5000","o":"12835","s":"_ZN5mongo10ThreadPool4Impl17_workerThreadBodyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE","s+":"B5"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C314C7B50","b":"7F3C314B5000","o":"12B50","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_10ThreadPool4Impl25_startWorkerThread_inlockEvEUlvE2_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C328F63CF","b":"7F3C328E9000","o":"D3CF","s":"execute_native_thread_routine","s+":"F"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C37AC12DE","b":"7F3C37AB9000","o":"82DE","s":"start_thread","s+":"FE"}}
[ReplWriterWorker-0] "Frame","attr":{"frame":{"a":"7F3C3746FA63","b":"7F3C37373000","o":"FCA63","s":"clone","s+":"43"}}

Sometimes it happens during background sync while rolling back to recover to a checkpoint, or during initial sync, or when oplog batches are being applied.

The full list of tests failing:

  • initial_sync_test_fixture_test.js
  • prepare_failover_rollback_commit.js
  • initial_sync_replicates_prepare_received_during_another_initial_sync.js
  • rollback_recovery_commit_transaction_before_stable_timestamp.js
  • commit_transaction_initial_sync_data_already_applied.js
  • initial_sync_fetch_from_oldest_active_transaction_timestamp.js
  • reconstruct_prepared_transactions_initial_sync_index_build.js
  • rollback_reconstructs_transactions_prepared_before_stable.js
  • initial_sync_fetch_from_oldest_active_transaction_timestamp_no_oplog_application.js
  • initial_sync_reset_oldest_timestamp_after_failed_attempt.js
  • initial_sync_commit_prepared_transaction.js
  • recover_committed_aborted_prepared_transactions.js
  • initial_sync_preserves_active_txns.js
  • reconstruct_prepared_transactions_initial_sync_no_oplog_application.js
  • reconstruct_prepared_transactions_initial_sync_change_oldest_active_txn_timestamp.js
  • reconstruct_prepared_transactions_initial_sync.js
  • reconstruct_prepared_transactions_initial_sync_on_oplog_seed.js
  • initial_sync_update_missing_doc_with_prepare.js
  • recover_prepared_txn_with_multikey_write_initial_sync.js

Tests observed failing with the same symptom in concurrency passthrough suites:

  • jstests/concurrency/fsm_workloads/multi_statement_transaction_atomicity_isolation_repeated_reads.js in concurrency_sharded_multi_stmt_txn_kill_primary (task)
  • jstests/concurrency/fsm_workloads/update_check_index.js in concurrency_sharded_multi_stmt_txn_kill_primary and concurrency_sharded_multi_stmt_txn_with_stepdowns (task, task)
  • jstests/concurrency/fsm_workloads/update_multifield_multiupdate.js in concurrency_sharded_multi_stmt_txn_with_stepdowns (task)
  • jstests/concurrency/fsm_workloads/multi_statement_transaction_simple_majority_writes.js in concurrency_sharded_multi_stmt_txn_with_stepdowns (task)
  • jstests/concurrency/fsm_workloads/multi_statement_transaction_kill_sessions_atomicity_isolation.js in concurrency_sharded_with_stepdowns (task)

The same asseer was also observed in initial_sync_fuzzer (task).



 Comments   
Comment by Githook User [ 14/Apr/21 ]

Author:

{'name': 'Mindaugas Malinauskas', 'email': 'mindaugas.malinauskas@mongodb.com'}

Message: SERVER-55676 [SBE][replica_sets] Tests trigger fatal assertion: NotPrimaryOrSecondary: Oplog collection reads are not allowed while in the rollback or startup state
Branch: master
https://github.com/mongodb/mongo/commit/4c166a537d7442cc58eea6df09d1721fa75d7b9b

Comment by Anton Korshunov [ 14/Apr/21 ]

mindaugas.malinauskas The same assert has cropped up in initial_sync_fuzzer. I updated the description, but we should try to re-enable this suite as part of this fix.

Comment by Justin Seyster [ 12/Apr/21 ]

This exception also appears to the root cause for SERVER-55669. I'll include more details in that ticket.

Comment by David Storch [ 12/Apr/21 ]

mindaugas.malinauskas, I'm actually seeing this crop up in a number of places in the concurrency suites, so I'm now maintaining a list of occurrences in the description of this ticket. See above for the list of test name, task name, and Evergreen link triples.

Comment by David Storch [ 12/Apr/21 ]

This also manifested as a timeout in jstests/concurrency/fsm_workloads/update_check_index.js when running under the concurrency_sharded_multi_stmt_txn_kill_primary suite. Task link.

Comment by David Storch [ 12/Apr/21 ]

I observed this failure crop up the concurrency_sharded_multi_stmt_txn_kill_primary task as well. (We don't run this suite with SBE enabled yet, but we saw it in the experimental patch build run by anton.korshunov which enabled SBE across the board for all test suites.) Here is the link to the failing task. The test that caused the failure was jstests/concurrency/fsm_workloads/multi_statement_transaction_atomicity_isolation_repeated_reads.js. I haven't investigated how frequent this failure is in the concurrency suite, but it does appear to fail in a non-deterministic fashion.

Comment by Kyle Suarez [ 01/Apr/21 ]

Exact fassert spot is here: https://github.com/mongodb/mongo/blob/201317dfba066ec91f090e43a297719a25823c6b/src/mongo/db/repl/replication_coordinator_impl.cpp#L2920

Generated at Thu Feb 08 05:37:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.