Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Catalog and Routing
Operating System:
ALL
Sprint:
CAR Team 2026-03-30
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Title:
Critical Outage in MongoDB 7.0.21 Sharded Cluster - Time Monotonicity Violation (Error Code 6493100)

MongoDB Version: 7.0.21
Deployment Type: Sharded Cluster (Config RS + Shards + Mongos)

Description

We experienced a complete outage in our MongoDB 7.0.21 sharded cluster environment. All cluster components went down simultaneously, including:

Shard replica sets

Config server replica set

Mongos routers

This resulted in full application downtime.

Error Observed

While reviewing the logs, we identified a Tripwire assertion related to a Time Monotonicity Violation, originating from the ReadThroughCache / ShardRegistry metadata refresh layer.

Error 1:

{"t":\{"$date":"2026-02-13T11:25:44.936+05:30"}

,"s":"E", "c":"ASSERT", "id":4457000, "ctx":"ShardRegistry-18921","msg":"Tripwire assertion","attr":{"error":{"code":6493100,"codeName":"Location6493100","errmsg":"Time monotonicity violation: lookup time

{ topologyTime: Timestamp(1750759199, 2), rsmIncrement: 40, forceReloadIncrement: 44506 }

which is less than the earliest expected timeInStore { topologyTime: Timestamp(1750882566, 2), rsmIncrement: 40, forceReloadIncrement: 44506 }."},"location":"{fileName:\"src/mongo/util/read_through_cache.h\", line:549, functionName:\"operator()\"}"}}
Error 2:

{"t":\{"$date":"2026-02-13T16:42:35.167+05:30"}

,"s":"F", "c":"CONTROL", "id":6384300, "ctx":"ShardRegistry-0","msg":"Writing fatal message","attr":{"message":"DBException::toString(): Location6493100: Time monotonic ity violation: lookup time

{ topologyTime: Timestamp(1750759199, 2), rsmIncrement: 6, forceReloadIncrement: 5 }

which is less than the earliest expected timeInStore { topologyTime: Timestamp(1750882566, 2), rsmIncrem ent: 6, forceReloadIncrement: 5 }.\nActual exception type: mongo::error_details::throwExceptionForStatus(mongo::Status const&)::NonspecificAssertionException\n\n"}}

Impact :

Full cluster outage

All mongos routers unavailable

Shard nodes became non-operational

Application downtime observed

Initial Findings

From our analysis:

The error is tied to topologyTime, which represents shard topology metadata stored in the config.shards collection.

The system detected a regression where the lookup metadata time was older than the cached/expected metadata time.

This triggered MongoDB's internal Tripwire safety assertion, resulting in process termination.

Recovery Attempts & Current Status

We restarted all cluster components:

Shard servers

Config servers

Mongos routers

However, even after restarting all servers, the cluster did not recover and the same issue persisted.

Config Metadata Validation

We verified shard topology metadata on the config primary.

Observation

Only one version of topologyTime is visible in the config.shards collection.

No historical or conflicting versions are present.

Output
csReplSet [direct: primary] config> db.shards.find()

[

{ _id: 'a', host: '...', state: 1, topologyTime: Timestamp(\{ t: 1750759145, i: 2 }

)
},

{ _id: 'b', host: '...', state: 1, topologyTime: Timestamp(\{ t: 1750759177, i: 1 }

)
},

{ _id: 'c', host: '...', state: 1, topologyTime: Timestamp(\{ t: 1750759199, i: 2 }

)
}
]

This indicates that only the current topology metadata is present, and we do not see any older topologyTime values stored in the collection.

Additional Observation

Even after restarting all cluster components, the issue persisted. Additionally, validation of the config.shards collection shows only the current topologyTime values, with no evidence of older or conflicting versions stored in the metadata.

Given this, it appears the system may have encountered an internal defect or an unexpected edge condition related to topologyTime handling or ShardRegistry cache refresh logic.

Assistance Required

We request assistance in identifying the root cause of this incident and recovering the data.

Specifically, we would like to understand what internal conditions (such as replication behavior, elections, metadata refresh cycles, or cache synchronization mechanisms) could lead to this state.

As this issue resulted in a complete production outage, we request your urgent investigation, guidance, and support to identify the root cause and prevent recurrence.

Assignee:: Yujin Kang Park
Reporter:: Anand Amarnath C
Participants:: Anand Amarnath C, Chris Kelly, Vishnu Kaushik, Yujin Kang Park
Votes:: 0 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Feb 17 2026 01:58:26 PM UTC
Updated:: Mar 23 2026 12:50:49 PM UTC
Resolved:: Mar 23 2026 12:50:49 PM UTC

Details

Description

Attachments

Activity

People

Dates