[SERVER-35282] Reading from unreplicated collections in a transaction has undefined behavior Created: 30/May/18 Updated: 27/Oct/23 Resolved: 07/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.0
|
||||||||||||||||||||
| Sprint: | Repl 2018-06-18 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||
| Description |
|
Consider a replica set primary with top of oplog = logical clock = commit point = T. A writer that creates an unreplicated collection (in a KVStorageEngine such as WT) takes the following steps:
A reader starting a new transaction and acquiring a collection takes the following steps:
If the reader opens the transaction (R1) before the writer commits (W3) and acquires locks (R2) after the writer releases (W4), the reader will observe the in-memory catalog believing the collection exists, but the storage engine will not (due to the concurrent transactions). In debug builds, this typically manifests as a crash when trying to access any index. It's unclear how this can manifest with arbitrary queries/sorts/aggregations being performed with the query code consuming this state where the in-memory catalog is inconsistent with the storage engine's recovery unit snapshot. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 07/Jun/18 ] |
|
While this issue theoretically still exists for any unreplicated collection read in a transaction, we believe that there are no unreplicated collections remaining that it is possible to read in a transaction (since we disallow accessing system collections or collections in the admin, config, and local databases), now that |
| Comment by Spencer Brody (Inactive) [ 31/May/18 ] |
|
max.hirschhorn and I just went through all uses of UnreplicatedWritesBlock, and I believe the only collections that could actually be affected by this issue (and aren't already disallowed in transactions by virtue of being system collections or living in the config or local dbs) is mapReduce temp collections. MapReduce writes to collection names that are valid for users to write to and would otherwise be replicated, but all the writes to that collection happen in an UnreplicatedWritesBlock. I'm not sure if there's really any way for the transaction system to know that these collections are special, since neither their name nor anything in the catalog about them are special, it's just the way the system writes to them that is unique. I think the only way to fix this might be to change the collection name that mapReduce writes to to either start with 'system' or be in the local database. Doing that would also have the advantage of fixing |
| Comment by Max Hirschhorn [ 31/May/18 ] |
|
I don't think the current implementation of ReplicationCoordinator::isOplogDisabledFor() is a complete answer to the question of "is this collection replicated?" as the reads and writes inside of a transaction would also be in an entirely separate OperationContext and therefore wouldn't know about UnreplicatedWriteBlock previously being used for that namespace. |
| Comment by Spencer Brody (Inactive) [ 30/May/18 ] |
|
Yeah, I think Kyle's right. We should forbid all operations, reads or writes, on unreplicated collections inside transactions. |
| Comment by Kyle Suarez [ 30/May/18 ] |
|
|
| Comment by Eric Milkie [ 30/May/18 ] |
|
I wonder if we should simply prohibit reads to such collections. In the BF, a reader was attempting to read from a temporary unreplicated map/reduce collection; such reads have no real-world use case. |