Add logging and metrics to make it easier to diagnose participant hangs from resharding shards auth changes

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Cluster Scalability
    • ClusterScalability 22Jun-6Jul
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      Participant hangs can arise from several hard to observe conditions:

      • Interference from stale coordinator commands
      • Hidden bug where participant state transition is not durable when the coordinator believes it is, resulting in lost promise emplacements across failovers 
      • Participant being unavailable or repeatedly failing commands in a way that makes resharding appear stuck rather than failed

      This makes quickly determining the cause of a hang difficult unless there is observability that makes those hard to observe conditions visible. 

      Mitigation

      Logs

       

      Log Why it helps
      Entry and exit logs for coordinator commands (with resharding UUID, lsid, txnNum and participant phase).  Reconstructs command flow to determine cause of hang
      Coordinator command retry attempt with reason. Makes retries visible and helps identify transient failures
      Logs every time a participant is waiting for a promise emplaced by coordinator and when it’s resolved. Quickly identify what unresolved promise is blocking the participant

      FTDC Metrics

       

      Metric Why it helps
      Counter for coordinator retries, broken down by command type.  Makes retries visible and shows what repeated transient failure is causing a hang
      Time spent waiting on promises to be emplaced by coordinator commands, broken down by command type See in FTDC where participants are spending time blocked waiting for signal from coordinator

       

            Assignee:
            Wenqin Ye
            Reporter:
            Wenqin Ye
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: