[SERVER-26134] Building sharding config database indexes holds up leaving drain mode Created: 16/Sep/16 Updated: 06/Dec/22 Resolved: 02/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.3.12 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | PM-108 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Sharding
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
The config database's indexes are being built while the CSRS config server's primary is still in draining mode, which means it is done under a global X lock. This holds up exiting draining mode and also blocks any incoming requests (such as initial sync). This is causing initial sync failures of this kind:
Note that the getmore command below took 18 seconds to acquire the global lock:
|
| Comments |
| Comment by Sheeri Cabral (Inactive) [ 02/Jan/20 ] |
|
wontfix and there's a workaround for the testing issue. |
| Comment by Judah Schvimer [ 30/Jan/17 ] |
|
I filed |
| Comment by Kaloian Manassiev [ 30/Jan/17 ] |
|
Unfortunately, we have no plans of fixing this any time soon. At the time when these indexes are built there should not be any data in the config database, so we aren't considering is as a bug in practice. I understand that it is making testing difficult, but fixing it is not trivial. |
| Comment by Judah Schvimer [ 30/Jan/17 ] |
|
kaloian.manassiev, Is there any plan to address this ticket in the near future if we do not do the above workaround? I would prefer to fix this rather than work around it, but if the fix is not going to happen for a while, it is worth doing the workaround. |
| Comment by Max Hirschhorn [ 27/Jan/17 ] |
|
judah.schvimer, spencer, milkie, is there consensus on whether we should work around this issue in our testing infrastructure for now by changing replica sets spawned by resmoke.py to do a reconfig after the primary of a 1-node replica set is elected? (akin to |
| Comment by Judah Schvimer [ 24/Jan/17 ] |
|
Yes, I think changing this section of replicaset.py would fix the remaining cases. I am not aware of any other ways that we start up replicasets in our tests. That said, if there's a way to fix this problem by not holding a global lock for as long, that would be a much better fix. |
| Comment by Spencer Brody (Inactive) [ 24/Jan/17 ] |
|
judah.schvimer If we changed resmoke.py to do the same thing that the shell does in |
| Comment by Judah Schvimer [ 23/Jan/17 ] |
|
There are now examples of this occurring in ways that |
| Comment by Eric Milkie [ 15/Dec/16 ] |
|
I agree with Judah. Implementing |
| Comment by Judah Schvimer [ 15/Dec/16 ] |
|
I think |
| Comment by Eric Milkie [ 27/Sep/16 ] |
|
I'm not sure this is the root of the problem. There are certainly other operations that can occur during drain mode, such as foreground index builds for user collections with data, that would take longer than the config server's index builds on empty collections. Perhaps the real solution here is to enhance index builds to run concurrently with other replicated ops, and produce a "commit" operation that actually commits the index and makes it visible. Then index builds wouldn't necessarily block replication. |