[SERVER-64125] MDB server sets the commit/durable timestamps equal to the stable timestamp Created: 02/Mar/22  Updated: 29/Oct/23  Resolved: 01/Apr/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Keith Bostic (Inactive) Assignee: Jordi Olivares Provencio
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on WT-7712 commit and durable timestamps should ... Closed
is depended on by WT-8902 MDB server sets the commit/durable ti... Closed
Problem/Incident
causes SERVER-66139 Stuck on oplog visibility when insert... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2022-03-21, Execution Team 2022-04-04
Participants:
Linked BF Score: 35

 Description   

Generally, it is not correct to allow a transaction to commit with a commit timestamp at the stable timestamp, or a prepared transaction to commit with a durable timestamp at the stable timestamp. This can confuse checkpoint as to whether the newly committed transaction should be included in the checkpoint and can potentially lead to data inconsistencies.

With the merge of WT-7712, WiredTiger standalone builds fail attempts to set the commit or durable timestamps to the stable timestamp, but MDB Server builds are allowed to do so.

We do not know of any actual MDB Server problems in this area, but it would be good to fix any place where this happens and change the WiredTiger standalone behavior apply to all builds, to avoid the introduction of problems in the future.



 Comments   
Comment by Githook User [ 01/Apr/22 ]

Author:

{'name': 'Jordi Olivares Provencio', 'email': 'jordi.olivares-provencio@mongodb.com', 'username': 'jordiolivares'}

Message: SERVER-64125 Avoid committing at the stable timestamp
Branch: master
https://github.com/mongodb/mongo/commit/c2de464c661ec8fc3198373932d0e3aab7fde971

Comment by Daniel Gottlieb (Inactive) [ 14/Mar/22 ]

Thanks for running that experiment again Keith.

I took a look. I only counted 3 unique failures, so I'll list them here:

  1. One is a disk_wiredtiger repair test. We timestamp writes when writing to the oplog (for historical reasons in 3.6 to make sure oplog entries were correctly made visible for a read_timestamp). We can remove that now as all oplog reads use MDB logic for visibility. Not sure if that's the only fallout in this test as repair/recovery are the cases I'd expect to see trip this assertion.
  2. storage_timestamp_tests. Same timestamp call when writing to the oplog. Being a unittest, I'm not worried about this being a blocker.
  3. The only real complaint was when we reconstruct prepared transactions that flip multikey. More detail:

A disclaimer, everything about how multikey is tracked is considered a wart. So while describing the behavior, please don't believe there's some virtue to how we do things. We got here because of constraints dating back to WT's adoption and haven't made the effort to schema our way out of this problem as it's a tech debt problem that carries on-disk format changes (i.e: upgrade/downgrade) and large risk.

Multikey is an index state when (most simply) an index contains multiple keys to the same MDB document/record. When an index is not* multikey, query can assume any document returned from an index is unique. But if an index is multikey, query must maintain a set of returned records to avoid double-returning the same document.

When a client inserts the first document that "flips multikey" we do that write on the _mdb_catalog document. Unfortunately that document tracks a bunch of detail, not limited to the specific index. Meaning preparing a change to that document can cause contention for things that are logically unrelated. Thus for things like flipping multikey in a prepared transaction, we make an effort to change that document in a separate transaction and commit it prior to the data write that it reflects. The system remains correct so long as multikey is true at or before there are multikey documents.

When reconstructing prepared transactions, we can find that multikey needs to be set (I'm not entirely sure why this happens for txns in a prepare state at the stable timestamp – understanding that may provide us with more outs of this situation). Right now we when reconstructing we can write at the stable timestamp. I think it's just as safe here to write at stable + 1.

That said, I do have a non-sequitur concern about these recovery writes where the multikey write has a timestamp larger than the prepare timestamp. It means that after the transaction commits, the transaction could choose a visible (commit_)timestamp smaller than the multikey write. This isn't strictly a problem today where multikey is read from a modern in-memory state (multikey doesn't "go backwards"). But this would be wrong in a versioned catalog world where multikey state is derived from the reader's snapshot.

Comment by Keith Bostic (Inactive) [ 11/Mar/22 ]

Now that WT-7712 has been merged into master (and a few other 5.3 inspired odds and ends have settled down), I wanted to rerun the SERVER-64125 patch test. It looks like there are roughly 9 unique test failures caused by updating WiredTiger to reject commits equal to the stable timestamp (assuming tests failing on different platforms are failing in the same way).

https://spruce.mongodb.com/version/622ba9f932f41735e6af12a8/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC

cc: geert.bosch, louis.williams, daniel.gottlieb, alexander.gorrod

Comment by Keith Bostic (Inactive) [ 02/Mar/22 ]

The specific changes to reverse are here and here.

Generated at Thu Feb 08 05:59:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.