[SERVER-32206] Catalog change to declare an index as multikey must be timestamped. Created: 07/Dec/17  Updated: 30/Oct/23  Resolved: 02/Feb/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 3.7.2

Type: Task Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: rollback-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-32284 awaitReplication can hang when the op... Closed
is depended on by SERVER-30809 Investigating remaining writes to the... Closed
Problem/Incident
causes SERVER-33106 Triggering an exception in BtreeKeyGe... Closed
Related
related to SERVER-33675 move multi key tracking from multiSyn... Closed
is related to SERVER-29213 Have KVWiredTigerEngine implement Sto... Closed
Backwards Compatibility: Fully Compatible
Sprint: Repl 2018-01-15, Repl 2018-01-29, Repl 2018-02-12
Participants:

 Description   

There are two cases to resolve. The first is easier. The second is a little trickier:

When a document is inserted/updated into a collection that can require multiple entries in an index (e.g: the value of an indexed field is an array), the index's "multikey" field must be set to true.

This update is currently done as a side-transaction to avoid write conflicts. Being a side-transaction throws away the ability to inherit a timestamp from the insert/update request.

Proposed solutions can be classified as:

  1. Remove the side transaction. Solutions here have varying degrees of effort to still allow for serialization of updates to the document to prevent a worst case scenario where progress slows down (stops?)
  2. Assign a timestamp to the write that is at least as early as any updates that are part of the request requiring this update.
    • A bit more clever, introduce an error case when inserts require the multikey field to be changed to true. Top-level handlers should translate this error to setting multikey to true (in a "before?-transaction") followed by performing the insert. An explicit timestamp would still need to be chosen, but there might be less plumbing required to make the timestamp and transaction meet.


 Comments   
Comment by Githook User [ 02/Feb/18 ]

Author:

{'email': 'judah@mongodb.com', 'name': 'Judah Schvimer', 'username': 'judahschvimer'}

Message: SERVER-32206 timestamp catalog change to declare index multikey
Branch: master
https://github.com/mongodb/mongo/commit/b2a7398e663ef090a651a93bedfc6d107a64cf33

Comment by Judah Schvimer [ 11/Jan/18 ]

Upon discussion with daniel.gottlieb, we will remove the side transaction.

On a primary, we can simply assign this write the same timestamp as the index creation, insert, or update that caused this index to become multikey. This is because if two operations concurrently try to change the index to be multikey, they will conflict and the loser will simply get a higher timestamp and go into the oplog second with a later optime.

On a secondary, writes must get the timestamp of their oplog entry, and the multikey change must occur at the timestamp of the earliest write that makes the index multikey. Secondaries only serialize writes by document, not by collection. If two inserts/updates that both make the index multikey are applied out of order, changing the index to multikey at the insert timestamps would change the index to multikey at the latter timestamp, which would be wrong. Index creations are applied serially with CRUD ops, so multikey index commits cannot conflict.

To prevent this we can do one of two things:
1. On secondaries we can abort any WT transaction that tries to set an index to be multikey in a batch of oplog entries. We then would reapply that batch serially. This would require either propagating down to the IndexCatalogEntry the fact that we're applying an entry on a secondary in a batch, or propagating up the fact that multikey is being set and making the decision at a higher level.
2. As Dan mentioned above, introduce an error case when inserts or updates require the multikey field to be changed to true. SyncTail could then see this error and set the multikey field to true. The insert/update timestamp could either be combined with the error information and we could do the write at the minimum of these timestamps, or we could simply do the write at the first timestamp in the batch. Since the batch cannot include any DDL operations, it should be safe to execute this write at an earlier timestamp after the batch completes.

Comment by Daniel Gottlieb (Inactive) [ 08/Dec/17 ]

For completeness, one edge case is out of scope for solving as recover to a timestamp is designed to solve this problem.

I believe secondaries applying operations in parallel can result in the following sequence. Consider two operations, one at Timestamp 1000 and the other at Timestamp 2000 that both require setting multikey to true:

 TS: 1000                                         TS: 2000
Insert A                                         Insert B
                                                 Set Multikey to true
                                                 Commit
In-memory observation that MultiKey does not need to be set
Commit

Then suppose the secondary recovers to timestamp 1500 followed by rolling forward the oplog[1]. Assuming the multikey write is only causally related to timestamp 2000, the catalog may incorrectly believe multikey to be false despite Insert A still existing. It is a goal of the recover to a stable timestamp is project to preserve this information. Specifically, it will save all multikey true values (and their MultikeyPaths) and restore those to indexes that still exist at recovery time. To not lose information in the face of a crash, that data can be made durable before recovering to the stable timestamp.

[1] On second thought, this wouldn't be a problem today. Storage can only recover to a "stable timestamp" and secondaries only submit oplog batch boundaries as candidates for becoming stable. I believe this prevents the scenario from incorrectly losing this information. However, reading at a timestamp inside the batch could return a false value despite still being able to observe write A. I'm not sure if those poses a problem for point in time reads on secondaries milkie

Comment by James Wahlin [ 07/Dec/17 ]

As discussed, we write multikey path information to the database as part of the the MetaData, which is a timestamped write.

Comment by Eric Milkie [ 07/Dec/17 ]

I don't think MultiKeyPaths is written to the database – it's only populated in memory.

Comment by James Wahlin [ 07/Dec/17 ]

Do we need to timestamp updates to MultikeyPaths as well or are they updated in a different manner?

Generated at Thu Feb 08 04:29:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.