Critical Outage in MongoDB 7.0.21 Sharded Cluster - Time Monotonicity Violation (Error Code 6493100)

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • ALL
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Title:
      Critical Outage in MongoDB 7.0.21 Sharded Cluster - Time Monotonicity Violation (Error Code 6493100)

      MongoDB Version: 7.0.21
      Deployment Type: Sharded Cluster (Config RS + Shards + Mongos)

      Description

      We experienced a complete outage in our MongoDB 7.0.21 sharded cluster environment. All cluster components went down simultaneously, including:

      Shard replica sets

      Config server replica set

      Mongos routers

      This resulted in full application downtime.

      Error Observed

      While reviewing the logs, we identified a Tripwire assertion related to a Time Monotonicity Violation, originating from the ReadThroughCache / ShardRegistry metadata refresh layer.

      Error 1:

      {"t":\{"$date":"2026-02-13T11:25:44.936+05:30"}

      ,"s":"E", "c":"ASSERT", "id":4457000, "ctx":"ShardRegistry-18921","msg":"Tripwire assertion","attr":{"error":{"code":6493100,"codeName":"Location6493100","errmsg":"Time monotonicity violation: lookup time

      { topologyTime: Timestamp(1750759199, 2), rsmIncrement: 40, forceReloadIncrement: 44506 }

      which is less than the earliest expected timeInStore { topologyTime: Timestamp(1750882566, 2), rsmIncrement: 40, forceReloadIncrement: 44506 }."},"location":"{fileName:\"src/mongo/util/read_through_cache.h\", line:549, functionName:\"operator()\"}"}}
      Error 2:

      {"t":\{"$date":"2026-02-13T16:42:35.167+05:30"}

      ,"s":"F", "c":"CONTROL", "id":6384300, "ctx":"ShardRegistry-0","msg":"Writing fatal message","attr":{"message":"DBException::toString(): Location6493100: Time monotonic ity violation: lookup time

      { topologyTime: Timestamp(1750759199, 2), rsmIncrement: 6, forceReloadIncrement: 5 }

      which is less than the earliest expected timeInStore { topologyTime: Timestamp(1750882566, 2), rsmIncrem ent: 6, forceReloadIncrement: 5 }.\nActual exception type: mongo::error_details::throwExceptionForStatus(mongo::Status const&)::NonspecificAssertionException\n\n"}}

      Impact : 

      Full cluster outage

      All mongos routers unavailable

      Shard nodes became non-operational

      Application downtime observed

      Initial Findings

      From our analysis:

      The error is tied to topologyTime, which represents shard topology metadata stored in the config.shards collection.

      The system detected a regression where the lookup metadata time was older than the cached/expected metadata time.

      This triggered MongoDB's internal Tripwire safety assertion, resulting in process termination.

      Recovery Attempts & Current Status

      We restarted all cluster components:

      Shard servers

      Config servers

      Mongos routers

      However, even after restarting all servers, the cluster did not recover and the same issue persisted.

      Config Metadata Validation

      We verified shard topology metadata on the config primary.

      Observation

      Only one version of topologyTime is visible in the config.shards collection.

      No historical or conflicting versions are present.

      Output
      csReplSet [direct: primary] config> db.shards.find()

      [
       

      {     _id: 'a',     host: '...',     state: 1,     topologyTime: Timestamp(\{ t: 1750759145, i: 2 }

      )
        },
       

      {     _id: 'b',     host: '...',     state: 1,     topologyTime: Timestamp(\{ t: 1750759177, i: 1 }

      )
        },
       

      {     _id: 'c',     host: '...',     state: 1,     topologyTime: Timestamp(\{ t: 1750759199, i: 2 }

      )
        }
      ]

      This indicates that only the current topology metadata is present, and we do not see any older topologyTime values stored in the collection.

      Additional Observation

      Even after restarting all cluster components, the issue persisted. Additionally, validation of the config.shards collection shows only the current topologyTime values, with no evidence of older or conflicting versions stored in the metadata.

      Given this, it appears the system may have encountered an internal defect or an unexpected edge condition related to topologyTime handling or ShardRegistry cache refresh logic.

      Assistance Required

      We request assistance in identifying the root cause of this incident and recovering the data.

      Specifically, we would like to understand what internal conditions (such as replication behavior, elections, metadata refresh cycles, or cache synchronization mechanisms) could lead to this state.

      As this issue resulted in a complete production outage, we request your urgent investigation, guidance, and support to identify the root cause and prevent recurrence. 

            Assignee:
            Unassigned
            Reporter:
            Anand Amarnath C
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: