Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92557

Add better diagnostics to identify cases of lost condition variable signal in oplog applier thread pool

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Replication

      In SERVER-56054 we made a change such that waits performed by oplog applier thread pool threads will eventually wake up after hitting the maxIdleThreadAge timeout if not sooner. This was to help mitigate a glibc bug that can cause a lost condition variable signal.

      Currently, if a user encounters such an issue it is difficult to diagnose it from FTDC and logs alone. Additionally, we don't have a definitive list of all such bugs and what exact glibc versions they affect on different Linux distributions, so it's not trivial to say for certain whether this problem is what a user faced.

      We should look into any diagnostics we could add (serverStatus metric, log messages) that would help more definitively identify cases where there was work to do yet oplog applier threads only got woken up due to hitting maxIdleThreadAge.

            Assignee:
            Unassigned Unassigned
            Reporter:
            kaitlin.mahar@mongodb.com Kaitlin Mahar
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated: