Investigate promoting storex-owned debug log lines --> warning/errors

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • Storage Execution 2026-05-11
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Background

      SERVER-123553 promoted a LOGV2_DEBUG log line in src/mongo/db/index/index_access_method.cpp to LOGV2_WARNING. The original log reported a "Suppressed key generation error" – an error being silently swallowed whose only signal lived at debug level, effectively invisible in production.

      As a follow-up, claude audited all storex-owned code (timeseries, bucket_catalog, collection_write_path, bsoncolumn, external_sorter, index_builds, key_string, storage_engine_integration, ttl, tracking_allocators) for other LOGV2_DEBUG call sites fitting the same pattern: an abnormal or suppressed condition that operators need to see in production but is currently hidden behind debug verbosity.

      This ticket tracks the candidates found. Each can be evaluated / promoted independently; feel free to spin sub-tickets per module if preferred.

      Strong candidates (clear SERVER-123553 pattern)

      1. bucket_compression.cpp:382 – swallowed compression exception

      File: src/mongo/db/timeseries/bucket_compression.cpp
      Current: LOGV2_DEBUG(5857800, 1, "Exception when compressing timeseries bucket, leaving it uncompressed", "error"_attr = exceptionToStatus())
      Suggested: LOGV2_WARNING
      Why: The outermost catch (...) inside compressBucket() absorbs any exception and returns an empty CompressionResult. Callers (e.g. compressUncompressedBucketOnReopen in timeseries_write_ops_internal.cpp) then uassert with a generic BucketCompressionFailure message that discards the original exception's type and text. Only the debug log carries the real exceptionToStatus().

      2. ttl_monitor.cpp:221 – silent skip of registered TTL index

      File: src/mongo/db/ttl/ttl_monitor.cpp
      Current: LOGV2_DEBUG(22535, 1, "index not found; skipping ttl job", "index"_attr = spec)
      Suggested: LOGV2_WARNING (or LOGV2_ERROR to match siblings)
      Why: In getValidTTLIndex, the two sibling "skipping TTL job" branches log at LOGV2_ERROR (22541, 6909100) and increment ttlInvalidTTLIndexSkips. The catalog-lookup-miss branch alone logs at debug with no metric bump, so a registered TTL index can silently stop running if its catalog entry disappears between isIndexReady and findIndexByName.

      3. storage_stats.cpp:324 – LockTimeout/MaxTimeMSExpired silently swallowed

      File: src/mongo/db/stats/storage_stats.cpp
      Current: LOGV2_DEBUG(3088801, 2, "Failed to retrieve storage statistics", logAttrs(nss), "error"_attr = ex), followed by return Status::OK()
      Suggested: Split the two catch handlers. Leave MaxTimeMSExpired at debug (user-configurable, expected). Raise the LockTimeoutwithwaitForLock=true case to LOGV2_WARNING.
      Why: The failed() lambda returns Status::OK() on LockTimeout/MaxTimeMSExpired, so callers see "success" with a result BSON missing every storage field. Current level is debug-2, even quieter than typical level-1 suppressed cases.

      Secondary candidates (worth promoting)

      4. checkpointer.cpp:144 – slow checkpoint (>=30s)

      File: src/mongo/db/storage/checkpointer.cpp
      Current: LOGV2_DEBUG(22308, 1, "Checkpoint was slow to complete", "secondsElapsed"_attr = secondsElapsed)
      Suggested: LOGV2_WARNING
      Why: Gate already fires only on a 30-second+ checkpoint – a genuine performance anomaly (cache pressure, IO stall, disk contention) that operators shouldn't have to raise verbosity to see.

      5. multi_index_block.cpp:1297 – abort due to mixed-schema data

      File: src/mongo/db/index_builds/multi_index_block.cpp
      Current: LOGV2_DEBUG(6057701, 1, "Aborting index build commit due to the earlier detection of mixed-schema data", ...)
      Suggested: LOGV2_WARNING
      Why: The earlier detection log at line 989 is plain LOGV2(6057700, "Detected mixed-schema data ...") (info). The terminal "we are failing because of it" log being debug-1 is inverted: operators see the detection but not the abort. Returns CannotCreateIndex.

      6. oplog_truncation.cpp:86 and :91 – silent non-truncation

      File: src/mongo/db/storage/oplog_truncation.cpp
      Current:

      LOGV2_DEBUG(5140900, 0, "Will not truncate entire oplog");
      LOGV2_DEBUG(5140901, 0, "Cannot truncate as there are no oplog entries after the truncate marker but before the truncate-up-to point", ...);
      

      Suggested: LOGV2_WARNING (or at minimum LOGV2 / info)
      Why: Both conditions cause reclaimOplog to silently return false, meaning an oplog-truncation attempt did nothing. The sibling error path already logs at LOGV2_WARNING(8841101). Silent non-truncation is operationally significant (oplog can grow unbounded); 5140901 in particular indicates mayTruncateUpTo is pinning too early – an "oplog stall / pinned" condition.

      Borderline / worth discussion

      7. storage_engine_impl.cpp:769 – drop reported success but ident still exists

      File: src/mongo/db/storage/storage_engine_impl.cpp
      Current: LOGV2_DEBUG(11440001, 1, "Internal ident already exists on disk after drop attempt; reusing existing ident", "ident"_attr = ident)
      Suggested: LOGV2_WARNING, possibly with rate-limiting.
      Why: Reached after a dropIdent() reported success but the subsequent createInternal() still throws ObjectAlreadyExists. Surrounding comment explicitly references SERVER-114575 ("a layered drop can report success while not dropping the table"). If this fires more than transiently it signals real inconsistency. Medium confidence because it may be noisy under known layered-drop conditions.

      8. index_builds_coordinator_mongod.cpp:198 – vote-command retry loop is silent

      File: src/mongo/db/index_builds/index_builds_coordinator_mongod.cpp
      Current: LOGV2_DEBUG(4666400, 1, "Failed to run index build vote command.", ...) inside while (needToVote()) with infinite exponential-backoff retry.
      Suggested: Either add a logSeveritySuppressor (see existing usage at index_builds_coordinator.cpp:2678) so sustained failures become visible while transient ones stay quiet, or promote directly to LOGV2_WARNING.
      Why: A persistently unreachable primary leaves an index-build vote silently stuck – operators see nothing about the stall.

      Scope covered

      • src/mongo/db/timeseries/ (incl. bucket_catalog)
      • src/mongo/db/index_builds/ + index/index_access_method*, index_key_validate*, index_repair*, index_spec_validate*, commit-quorum/vote commands
      • src/mongo/db/storage/ (excluding wiredtiger/, devnull/, key_string/)
      • Storage-integration ancillary: action_duration_metrics*, record_id*, server_recovery*, mongod_options_storage*, dbhelpers*, import*, commands/fsync*, commands/dbcommands*, commands/resize_oplog*, op_observer/batched_write*, stats/storage_stats*
      • src/mongo/db/ttl/, src/mongo/db/sorter/, src/mongo/bson/column/, src/mongo/util/tracking/, src/mongo/db/storage/key_string/, src/mongo/db/collection_crud/, src/mongo/db/replicated_fast_count/

      Roughly 130+ LOGV2_DEBUG/LOGV2_INFO sites reviewed; most were either normal progress/tracing or had an already-WARNING/ERROR sibling on the genuine error branch.

      Notes for implementation

      • Log IDs should be kept (same ID, new severity) to preserve log-id continuity, per the same approach as SERVER-123553.
      • A few of these (#6, #7) may benefit from message rephrasing as SERVER-123553 did (adding context like "on secondary" / "during oplog application") – worth considering per site.

            Assignee:
            Stephanie Eristoff
            Reporter:
            Stephanie Eristoff
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: