ValidateCollections hook can fail with NotPrimaryOrSecondary when a node is mid-rollback after stepdown

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • ALL
    • Hide

      In stepdown suites with catchUpTimeoutMillis: 0, a node may enter ROLLBACK state after a ContinuousStepdown cycle. _await_primaries() (stepdown.py:325) only confirms a primary exists before handing control to subsequent hooks, it does not wait for all nodes to exit ROLLBACK. validate_node() (validate.py:187) then connects to each node via directConnection=true and calls list_database_names(), which fails with NotPrimaryOrSecondary (13436) on a mid-rollback node. The bare except: treats this as a validation failure.

      I believe since rollback is not that common in steady state, this failure is otherwise rare, but I think this is the mechanism behind the linked BF (which is closed because it 

      Show
      In stepdown suites with catchUpTimeoutMillis: 0 , a node may enter ROLLBACK state after a ContinuousStepdown cycle. _await_primaries() ( stepdown.py:325 ) only confirms a primary exists before handing control to subsequent hooks, it does not wait for all nodes to exit ROLLBACK. validate_node() ( validate.py:187 ) then connects to each node via directConnection=true and calls list_database_names() , which fails with NotPrimaryOrSecondary (13436) on a mid-rollback node. The bare except: treats this as a validation failure. I believe since rollback is not that common in steady state, this failure is otherwise rare, but I think this is the mechanism behind the linked BF (which is closed because it 
    • DevProd Test Infra 2026-05-19
    • 200
    • None
    • None
    • None
    • None
    • None
    • None
    • None

          Assignee:
          Unassigned
          Reporter:
          Malik Endsley
          Votes:
          0 Vote for this issue
          Watchers:
          3 Start watching this issue

            Created:
            Updated: