[SERVER-44462] Run listIndexes with a single snapshot during initial sync Created: 06/Nov/19  Updated: 09/Dec/19  Resolved: 09/Dec/19

Status: Closed
Project: Core Server
Component/s: Catalog, Index Maintenance, Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Judah Schvimer Assignee: Eric Milkie
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-33946 Decrease number of initial sync attem... Blocked
Related
is related to SERVER-27122 Restart initial sync for known index ... Backlog
Operating System: ALL
Sprint: Execution Team 2019-12-02, Execution Team 2019-12-16
Participants:

 Description   

Without this we can get idempotency problems during initial sync. We can try to create two indexes during collection cloning that never existed simultaneously in real life. Eventually we will get a dropIndex oplog entry for one of the two, but at collection clone time we don't know which one so we can't just drop one to fix the idempotency violation.

This was hypothesized while I talked with milkie regarding SERVER-27122.



 Comments   
Comment by Judah Schvimer [ 09/Dec/19 ]

I think that this ticket describes idempotency problems that could happen at index creation time in collection cloning. We usually have thought about idempotency problems as happening at oplog application time. I'm fine closing this until we see this happen in reality. It's possible that fixing the index idempotency problems in oplog application will fix idempotency problems during collection cloning, though I don't think that will happen automatically.

Comment by Eric Milkie [ 09/Dec/19 ]

I think Siyuan is correct; this won't help things. We need to do the work in SERVER-27122 to make headway on this.

Comment by Judah Schvimer [ 11/Nov/19 ]

Eric Milkie, can we relax those constraints in initial sync?

From discussing with milkie, our plan is for execution to do SERVER-27122 to allow SERVER-33946 to reduce the number of initial sync attempts to 1, and then solve the idempotency problems based on amount of user pain. Some like the number of indexes and number of text indexes may be easy to relax, others like having different specs but the same name may be difficult to relax.

Comment by Judah Schvimer [ 11/Nov/19 ]

I agree reading from a snapshot won't solve all index idempotency problems. The concern was that solutions to the already known tickets linked to SERVER-33946 would not necessarily be sufficient to solve this class of problems.

Comment by Siyuan Zhou [ 11/Nov/19 ]

Reading indexes from a single snapshot isn't sufficient. In the cases robert.guo listed in SERVER-27122, the indexes are all read from a single snapshot. We probably need to fix them case by case.

1. creating indexes with different specs but the same name.
This will be fixed in SERVER-32225.

2. creating text indexes with different specs.
3. having more than 64 indexes combined, before and after dropping a collection.
milkie, can we relax those constraints in initial sync?

Alternatively, we can also develop a solution based on indexes read from a snapshot. We need to make sure the entries that changed indexes before this snapshot's timestamp are ignored.

Comment by Judah Schvimer [ 11/Nov/19 ]

tess.avitabile asked what happens today, create both indexes or choose one. I think we would attempt to create both and get various errors depending on how the indexes are mismatched. We haven't tested this though, and so the first step should be testing with various indexes known to be incompatible to see the behavior.

Comment by Eric Milkie [ 11/Nov/19 ]

I was mistaken; this should go on Execution for scheduling.

Comment by Judah Schvimer [ 11/Nov/19 ]

milkie thought this was a query request since the change lives in the listIndexes command. If you think that's incorrect, I can put it on repl.

Comment by Charlie Swanson [ 09/Nov/19 ]

judah.schvimer is this supposed to be on the query team's backlog? I'm not sure I'd know exactly where to start for such a task or how to test it. It does seem in-between query and repl, but I thought I'd check if this was a mistake before we triage.

Generated at Thu Feb 08 05:06:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.