[SERVER-57174] Tightening timestamp order within a single transaction Created: 25/May/21  Updated: 17/Jun/21  Resolved: 17/Jun/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Chenhao Qu Assignee: Gregory Noma
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to WT-7569 Fix wrongly squash an out of order ti... Closed
Sprint: Execution Team 2021-06-14, Execution Team 2021-06-28
Participants:

 Description   

We hit a bug in WT-7569 that WiredTiger doesn't handle an out of order timestamp edge case correctly.

The edge case is that mongodb performed multiple updates on the same key within the same transaction. However, these updates are performed with different timestamps and there is an out of order timestamp within them, e.g.,

U@20 -> U@15 -> U@20 -> U@10

The bug in WiredTiger assumes that there would be no out of order timestamp update within a single transaction. But apparently it is wrong. The WiredTiger team has a fix for this problem but we also suspect there may be something wrong within the code of mongodb.

In WiredTiger, we have a check to ensure the commit timestamp of each update is not smaller than the first commit timestamp.

        /*
         * Compare against the commit timestamp of the current transaction. Return an error if the
         * given timestamp is older than the first commit timestamp.
         */
        if (F_ISSET(txn, WT_TXN_HAS_TS_COMMIT) && commit_ts < txn->first_commit_timestamp)
            WT_RET_MSG(session, EINVAL,
              "commit timestamp %s older than the first commit timestamp %s for this transaction",
              __wt_timestamp_to_string(commit_ts, ts_string[0]),
              __wt_timestamp_to_string(txn->first_commit_timestamp, ts_string[1]));

We did an experiment to tight the check to ensure the commit timestamps are strictly in order and our mongodb patch build reports a lot of failures, indicating mongodb is doing out of order timestamp update quite frequently within a single transaction.

Can we explore tightening timestamp usage within a single transaction?



 Comments   
Comment by Chenhao Qu [ 16/Jun/21 ]

gregory.noma Yes, WiredTiger is safe to handle it after WT-7569 is merged.

Comment by Daniel Gottlieb (Inactive) [ 15/Jun/21 ]

I believe I was able to repro this with a plain old vectored insert:

(function() {
let replTest = new ReplSetTest({nodes: 1});
replTest.startSet();
replTest.initiate();
 
let db = replTest.getPrimary().getDB("test");
assert.commandWorked(db.createCollection("foo"));
assert.commandWorked(db.foo.createIndexes([{a: 1}]));
assert.commandWorked(db.foo.createIndexes([{b: 1}]));
assert.commandWorked(db.foo.insert([{_id: 1, a: 1, b: [1, 2]}, {_id: 2, a: [3, 4], b: 5}]));
 
replTest.stopSet();
})();

The MDB transaction this creates (pardon my charming typos):

[conn1] DBG. Beginning trasnaction
[conn1] DBG. Indexing records. Index: { _id: 1 }
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 2)
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 3)
[conn1] DBG. Indexing records. Index: { a: 1.0 }
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 2)
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 3)
[conn1] Index set to multi key, clearing query plan cache{"namespace":"test.foo","keyPattern":{"a":1}}
[conn1] DBG. Indexing records. Index: { b: 1.0 }
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 2)
[conn1] Index set to multi key, clearing query plan cache{"namespace":"test.foo","keyPattern":{"b":1}}
[conn1] DBG. Setting timestamp: Timestamp(1623787675, 3)
[conn1] DBG. Committing trasnaction
[conn1] DBG. Beginning trasnaction
[conn1] DBG. Rollinac back trasnaction

Our RollbackTest helper probably never found this because I believe it only lets us rollback on the granularity of a user command, i.e: it cannot attempt to rollback half of a multi-document vectored insert.

We probably also haven't observed this, because this doesn't lead to data corruption (a collection containing a multikey document without the appropriate indexes being marked multikey).

With vectored inserts we index records for each individual index in insertion/timestamp order. So the multikey updates are always "appropriately" timestamped. This results in something more akin to "ghost timestamps" where we may roll back less changes, leaving behind some benign multikey state when the accompanying multikey documents were rolled back.

Comment by Daniel Gottlieb (Inactive) [ 29/May/21 ]

I found a failure with datafiles to correlate the durable_timestamp s in WT with the oplog. I found these are from regular inserts:

test> oplog.cfindOne({ts:Timestamp(1621606702, 22)})
{
	"op" : "i",
	"ns" : "tenantMigrationTenantId_test.distinct_multikey",
	"ui" : UUID("5a441857-cd83-4b5f-9f96-c6d24f7a6d88"),
	"o" : {
		"_id" : ObjectId("60a7c12bab358a91ab1e5d83"),
		"a" : {
			"b" : {
				"c" : [ ]
			}
		}
	},
	"ts" : Timestamp(1621606702, 22),
	"t" : NumberLong(1),
	"v" : NumberLong(2),
	"wall" : ISODate("2021-05-21T14:18:22.512Z"),
	"fromTenantMigration" : UUID("ed0795ca-43bd-4295-81e5-b24815795b4c")
}

lingzhi.deng, I believe this isn't only problematic for WT, but a rollback bug waiting to happen where multikey state on a migrated collection after rollback may not be consistent with the data. This is assuming tenant migrations survive elections.

cc samy.lanka

Comment by Daniel Gottlieb (Inactive) [ 29/May/21 ]

All of the failures related to that BF are on tenant migrations variants. Here is a bson object pointed to by an update from within WT:

(gdb) ptype out_of_order_ts_updates.list
type = struct __wt_update {
    volatile uint64_t txnid;
    wt_timestamp_t durable_ts;
    wt_timestamp_t start_ts;
    wt_timestamp_t prev_durable_ts;
    WT_UPDATE *next;
    uint32_t size;
    uint8_t type;
    volatile uint8_t prepare_state;
    uint8_t flags;
    uint8_t data[];
} *[20]
(gdb) print out_of_order_ts_updates.list
$4 = {0x7fa43b7a3020, 0x0 <repeats 19 times>}
(gdb) print out_of_order_ts_updates.list[0]
$5 = (WT_UPDATE *) 0x7fa43b7a3020
(gdb) print *out_of_order_ts_updates.list[0]
$6 = (WT_UPDATE ?) {
  txnid = 2064, 
  durable_ts = 6963901763766190111, 
  start_ts = 6963901763766190111, 
  prev_durable_ts = 0, 
  next = 0x7fa414be74a0, 
  size = 1023, 
  type = 3 '\003', 
  prepare_state = 0 '\000', 
  flags = 0 '\000', 
  data = 0x7fa43b7a304f "\377\003" <--- bson object?
}
 
 
(gdb) python print(next(bson.decode_iter(gdb.selected_inferior().read_memory(0x7fa43b7a304f, 5000).tobytes())))
{'md': {'ns': 'tenantMigrationTenantId_test.distinct_multikey', 'options': {'uuid': UUID('1e3abf33-bc5d-42c9-89d4-ba7b772e5bf6')}, 'indexes': [{'spec': {'v': 2, 'key': {'_id': 1}, 'name': '_id_'}, 'ready': True, 'multikey': False, 'multikeyPaths': {'_id': b'\x00'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b.c': 1.0}, 'name': 'a.b.c_1'}, 'ready': True, 'multikey': True, 'multikeyPaths': {'a.b.c': b'\x01\x01\x01'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b': 1.0}, 'name': 'a.b_1'}, 'ready': True, 'multikey': True, 'multikeyPaths': {'a.b': b'\x00\x01'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b.0': 1.0}, 'name': 'a.b.0_1'}, 'ready': True, 'multikey': False, 'multikeyPaths': {'a.b.0': b'\x00\x00\x00'}, 'head': 0, 'backgroundSecondary': False}]}, 'idxIdent': {'_id_': 'index-283-7836790310555739620', 'a.b.c_1': 'index-284-7836790310555739620', 'a.b_1': 'index-285-7836790310555739620', 'a.b.0_1': 'index-286-7836790310555739620'}, 'ns': 'tenantMigrationTenantId_test.distinct_multikey', 'ident': 'collection-282-7836790310555739620'}

And what I think* is the previous update in the chain (variable is actually named WT_UPDATE.next):

(gdb) print *(*out_of_order_ts_updates.list[0]).next
$12 = (WT_UPDATE ?) {
  txnid = 2064, 
  durable_ts = 6963901763766190115, 
  start_ts = 6963901763766190115, 
  prev_durable_ts = 0, 
  next = 0x7fa41480cb20, 
  size = 1023, 
  type = 3 '\003', 
  prepare_state = 0 '\000', 
  flags = 0 '\000', 
  data = 0x7fa414be74cf "\377\003"
}
(gdb) python print(next(bson.decode_iter(gdb.selected_inferior().read_memory(0x7fa414be74cf, 5000).tobytes())))
{'md': {'ns': 'tenantMigrationTenantId_test.distinct_multikey', 'options': {'uuid': UUID('1e3abf33-bc5d-42c9-89d4-ba7b772e5bf6')}, 'indexes': [{'spec': {'v': 2, 'key': {'_id': 1}, 'name': '_id_'}, 'ready': True, 'multikey': False, 'multikeyPaths': {'_id': b'\x00'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b.c': 1.0}, 'name': 'a.b.c_1'}, 'ready': True, 'multikey': True, 'multikeyPaths': {'a.b.c': b'\x01\x01\x01'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b': 1.0}, 'name': 'a.b_1'}, 'ready': True, 'multikey': False, 'multikeyPaths': {'a.b': b'\x00\x00'}, 'head': 0, 'backgroundSecondary': False}, {'spec': {'v': 2, 'key': {'a.b.0': 1.0}, 'name': 'a.b.0_1'}, 'ready': True, 'multikey': False, 'multikeyPaths': {'a.b.0': b'\x00\x00\x00'}, 'head': 0, 'backgroundSecondary': False}]}, 'idxIdent': {'_id_': 'index-283-7836790310555739620', 'a.b.c_1': 'index-284-7836790310555739620', 'a.b_1': 'index-285-7836790310555739620', 'a.b.0_1': 'index-286-7836790310555739620'}, 'ns': 'tenantMigrationTenantId_test.distinct_multikey', 'ident': 'collection-282-7836790310555739620'}

My eyeball diff of those two is this index being flipped from multikey: false -> true:

              {'spec': {'v': 2, 'key': {'a.b': 1.0}, 'name': 'a.b_1'},
               'ready': True,
               'multikey': False,
               'multikeyPaths': {'a.b': b'\x00\x00'},
               'head': 0,
               'backgroundSecondary': False},
...
              {'spec': {'v': 2, 'key': {'a.b': 1.0}, 'name': 'a.b_1'},
               'ready': True,
               'multikey': True,
               'multikeyPaths': {'a.b': b'\x00\x01'},
               'head': 0,
               'backgroundSecondary': False},

lingzhi.deng how does tenant migration flip multikey in the catalog? Is it understood that's happening in the same WUOW as other catalog writes with different timestamps associated?

Comment by Chenhao Qu [ 29/May/21 ]

daniel.gottlieb BF-21248 (WT-7569) crashed because of the out of order timestamp update from the same transaction on the same key. Can you get any information from the core dump? I'll try to come up with a patch to reproduce that.

Comment by Gregory Noma [ 29/May/21 ]

In terms of the out-of-order timestamp (on different keys) scenario that Dan mentioned, this is what CollectionImpl::insertDocuments does, which is used in several places including batched user inserts. So it is something that we do quite often.

Comment by Daniel Gottlieb (Inactive) [ 29/May/21 ]

We do have at least one intentional use case (at least back on 4.0, and I expect it still exists) of:

BeginTxn
SetTimestamp 10
Insert Record(A)
SetTimestamp 20
Insert Record(B)
SetTimestamp 10
Insert Indexes(A)
SetTimestamp 20
Insert Indexes(B)
Commit

I'm not aware of any circumstance where this will create an individual update chain on a document that will have out of order timestamps, so I don't think the weaker assertion (MDB will reset a WT transaction's commit_timestamp) is sufficient to demonstrate the stronger assertion (MDB generates update chains with the same txnid, but out of order timestamps).

The best way forward chenhao.qu is to provide us a patch that crashes the server when we hit exactly this condition. With that a server engineer can reproduce and narrow down what's going on.

Comment by Chenhao Qu [ 28/May/21 ]

gregory.noma The issue in WT-7569 is because "multiple updates on the same key within the same transaction". The check in the patch build is for out-of-order timestamps within the same transaction. So both cases are happening. I think even the update are not on the same key, the update should be in timestamp order as updates within the same transaction are done by a single thread.

Comment by Gregory Noma [ 28/May/21 ]

chenhao.qu In the description you mention "multiple updates on the same key within the same transaction," but in your patch build it looks like the check is just for out-of-order timestamps within the same transaction, even if those updates are on different keys. If my understanding is correct, which scenario is the one of concern: out-of-order timestamps for the same key within a transaction, or out-of-order timestamps within a transaction regardless of the keys? Or let me know if I'm misunderstanding something.

Comment by Louis Williams [ 25/May/21 ]

This appears to be a problem with time-series inserts based on the stack trace in this comment.

Generated at Thu Feb 08 05:41:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.