[SERVER-26717] PSA flapping during netsplit when using PV1 Created: 20/Oct/16 Updated: 06/Dec/22 Resolved: 18/Jan/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Kovacs | Assignee: | Backlog - Replication Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: | 1. Create a 3-member PSA replset spread across 3 VMs, which will represent our 3 DCs. DC1 and DC2 contain databearing nodes and DC3 contains the arbiter. |
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||
| Description |
|
Under PV1 when using a PSA (or PSSSA) replset spread across three data centres, the primary node flaps between DC1 and DC2 every 10 seconds during a netsplit between DC1 and DC2. Each data centre receives roughly half the writes (assuming roughly constant write traffic). When the netsplit is resolved, the data in the non-primary data centre is rolled back. When the netsplit occurs, the following sequence of events happen: Here is a snippet of logs from the arbiter demonstrating the flapping behaviour: N.B. Flapping does not occur with PSS/PV1 or PSA/PV0. |
| Comments |
| Comment by Spencer Brody (Inactive) [ 18/Jan/17 ] |
|
The plan to address this is no longer to implement SERVER-14539. Instead we are going to do the smaller change of |
| Comment by Spencer Brody (Inactive) [ 07/Nov/16 ] |
|
Since we didn't wind up doing |
| Comment by Spencer Brody (Inactive) [ 21/Oct/16 ] |
|
Assuming current primary is in DC1... This works in PV1 with PSS because in that case the secondary in DC2 would vote no to the node in DC3 becoming primary because it would be ahead of that in terms of replication from writes that DC1 took. This will be fixed short term by |