[SERVER-49688] TTLMonitor Fatal assertion 29001 Bad Value - Invalid Argument in wt record store Created: 17/Jul/20  Updated: 22/Jun/22  Resolved: 20/Oct/20

Status: Closed
Project: Core Server
Component/s: Stability, TTL
Affects Version/s: 4.2.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Chad Kreimendahl Assignee: Eric Milkie
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Stretch VM
16G memory.
2 full member + 1 arbiter RS
Snappy compression
6G Cache size
6.2G total collective database sizes
1G Oplog
internalQueryExecMaxBlockingSortBytes: 1073741824


Attachments: Text File 427-crashlog.txt    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

Have TTL Indexes? Wait until they run?

Sprint: Execution Team 2020-08-24, Execution Team 2020-09-21, Execution Team 2020-10-05, Execution Team 2020-11-02
Participants:

 Description   

Hard crash on 4.2.7 during a TTL run.  Of note. These TTL indexes ran fine for several weeks, without any issue. There are a bunch of them we have. Unsure which specific one caused this.

 

2020-07-06T12:00:28.885-0500 E STORAGE [TTLMonitor] WiredTiger error (22) [1594054828:882101][54946:0x7fc3a47cc700], WT_SESSION.timestamp_transaction: __wt_txn_set_commit_timestamp, 683: commit timestamp (1594054828, 1937) is less than the oldest timestamp (1594054828, 1941): Invalid argument Raw: [1594054828:882101][54946:0x7fc3a47cc700], WT_SESSION.timestamp_transaction: __wt_txn_set_commit_timestamp, 683: commit timestamp (1594054828, 1937) is less than the oldest timestamp (1594054828, 1941): Invalid argument 2020-07-06T12:00:28.885-0500 F - [TTLMonitor] Fatal assertion 39001 BadValue: timestamp_transaction 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 1323



 Comments   
Comment by Eric Milkie [ 20/Oct/20 ]

Unfortunately, after some attempts we were unable to reproduce this issue, and careful code inspection did not otherwise reveal the way in which this error was produced. If it happens again, please let us know so that we can reopen this ticket and continue the investigation with further failure data.

Comment by Chad Kreimendahl [ 11/Aug/20 ]

It is disabled. This is the only environment in which that is true, as it's the only environment in which we use a PSA architecture (it's an integration testing system, where we create copies of high-load stuff and validate mongodb updates, where errors like this one would prevent our upgrade process (from 3.6 to 4.2 hopefully soon). We would not use this config in production, as there's no use for arbiters. Other than for failover, there's also nearly no use for the secondaries. They're a security blanket for us more than a destination for queries. less that 0.001% of our queries can be performed on data that's not up-to-date.

 

(excluding some authorization items from config for obvious reasons)

 

systemLog:
 destination: file
 logAppend: true
 verbosity: 0
 path: logpath
processManagement:
 fork: false
 pidFilePath: pidpath
net:
 bindIp: x.x.x.x
 port: 27017
 maxIncomingConnections: 51200
 wireObjectCheck: true
 ipv6: false
 unixDomainSocket:
  enabled: false
 tls:
  mode: preferTLS
  disabledProtocols: TLS1_0,TLS1_1
  certificateKeyFile: pemfile
 compression:
  compressors: "snappy"
security:
 authorization: disabled
storage:
 dbPath: pathofdatabases
 directoryPerDB: true
 engine: wiredTiger
 wiredTiger:
  engineConfig:
   cacheSizeGB: 6
   journalCompressor: snappy
   directoryForIndexes: true
  collectionConfig:
   blockCompressor: snappy
  indexConfig:
   prefixCompression: true
 journal:
  enabled: true
  commitIntervalMs: 50
setParameter:
 internalQueryExecMaxBlockingSortBytes: 1073741824
operationProfiling:
 slowOpThresholdMs: 100
 mode: slowOp
replication:
 oplogSizeMB: 1024
 replSetName: SomeClusterName
 enableMajorityReadConcern: false

Comment by Daniel Gottlieb (Inactive) [ 11/Aug/20 ]

Hi chad@onspring.com, can you share the configuration that instance was running with? With luck, the configuration information should be available when the node restarts and is repeated at the top of a file when performing a MongoDB log rotation, e.g:

2020-08-11T16:51:54.104-0400 I  CONTROL  [initandlisten] options: { net: { port: 20011 }, replication: { enableMajorityReadConcern: false }, storage: { directoryPerDB: true } }

I'm mostly curious about whether majority read concern is disabled.

Comment by Chad Kreimendahl [ 11/Aug/20 ]

It has not. I always prefer when things are repeatable or have some pattern. This one does not have one easily identifiable.

The previous 6 hours of logs contains just permutations of this message:

2020-07-06T11:58:27.692-0500 I NETWORK [conn73116] received client metadata from 10.x.x.x:51666 conn73116: { driver: { name: "NetworkInterfaceTL", version: "4.2.7" }, os: { type: "Linux", name: "PRETTY_NAME="someclient"", architecture: "x86_64", version: "somekernel" } }

And the lines in question:

2020-07-06T11:58:41.351-0500 I  STORAGE  [WT-OplogTruncaterThread-local.oplog.rs] WiredTiger record store oplog truncation finished in: 2ms
2020-07-06T11:58:54.383-0500 I  NETWORK  [conn73117] received client metadata from ...
2020-07-06T11:58:56.389-0500 I  NETWORK  [conn73118] received client metadata from ...
2020-07-06T11:59:20.308-0500 I  STORAGE  [WT-OplogTruncaterThread-local.oplog.rs] WiredTiger record store oplog truncation finished in: 1ms
2020-07-06T11:59:53.348-0500 I  STORAGE  [WT-OplogTruncaterThread-local.oplog.rs] WiredTiger record store oplog truncation finished in: 6ms
2020-07-06T12:00:28.885-0500 E  STORAGE  [TTLMonitor] WiredTiger error (22) [1594054828:882101][54946:0x7fc3a47cc700], WT_SESSION.timestamp_transaction: __wt_txn_set_commit_timestamp, 683: commit timestamp (1594054828, 1937) is less than the oldest timestamp (1594054828, 1941): Invalid argument Raw: [1594054828:882101][54946:0x7fc3a47cc700], WT_SESSION.timestamp_transaction: __wt_txn_set_commit_timestamp, 683: commit timestamp (1594054828, 1937) is less than the oldest timestamp (1594054828, 1941): Invalid argument
2020-07-06T12:00:28.885-0500 F  -        [TTLMonitor] Fatal assertion 39001 BadValue: timestamp_transaction 22: Invalid argument at src/mongo/db/storage/wiredtiger/wiredtiger_record_store.cpp 1323

Comment by Eric Milkie [ 11/Aug/20 ]

And it hasn't happened again since July 17?

Comment by Chad Kreimendahl [ 11/Aug/20 ]

I'll dig for it, but it may be gone at this point. I went through the 200 lines of logs just prior, and all of them were normal things you'd expect to see happening all the time in our db. All things that had happened hundreds to millions of times that day, already.

Comment by Eric Milkie [ 10/Aug/20 ]

Hi Chad, Do you have the full log from the database server that crashed? I'm interested in what the server was doing prior to hitting the problem.

Comment by Gregory Wlodarek [ 03/Aug/20 ]

We'll investigate this in the next sprint.

Generated at Thu Feb 08 05:20:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.