[SERVER-35996] Create performance tests for measuring failover speed for planned stepdowns Created: 06/Jul/18  Updated: 08/Jan/24  Resolved: 27/Mar/23

Status: Closed
Project: Core Server
Component/s: Performance, Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Backlog - Replication Team
Resolution: Won't Do Votes: 0
Labels: PM-1211, re-triaged-ticket
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File secondary_failover_time.js    
Issue Links:
Related
related to SERVER-35200 Speed up failure detection in the Opl... Closed
related to SERVER-35835 Allow quicker sync source change when... Closed
Assigned Teams:
Replication
Participants:

 Description   

We would like to add better performance testing for measuring the speed of planned replica set failovers using replSetStepDown. This includes testing the following:

  1. The amount of time between when an old primary is able to accept writes and when a new primary is able to accept writes.
  2. The amount of time between when an old primary is able to commit majority writes and when a new primary is able to commit majority writes.

We should test these scenarios with chaining enabled and disabled.



 Comments   
Comment by Lauren Lewis (Inactive) [ 09/Nov/21 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Tess Avitabile (Inactive) [ 15/Aug/18 ]

Thanks, mira.carey@mongodb.com. I modified this ticket so that it is about time to write-availability and time to commit-availability for a single writer during planned maintenance.

Comment by Mira Carey [ 15/Aug/18 ]

I plan to do some testing around elections, but focused on what happens after the election clears (and targeted towards mongos behavior, things like how manipulating connection pooling strategies changes things). I also suspect that the easiest test for you would be a single writer (and time for a w:majority write to clear) where I'm going to want some large number of writers across many connections.

It's related testing, but not substantially overlapping

Comment by Tess Avitabile (Inactive) [ 25/Jul/18 ]

Thanks, I will bring this up in the developer productivity quarterly planning. Failover time very important to Atlas, so I agree that either replication, service architecture, or perf should prioritize this work.

Comment by William Schultz (Inactive) [ 25/Jul/18 ]

The attached JS script is a good starting point for testing item 2 from the ticket description.

Comment by William Schultz (Inactive) [ 06/Jul/18 ]

This testing would help us detect issues like the ones mentioned in SERVER-35835. It may also serve as a means to test the performance of the new stepdown behavior being implemented as part of the Election Handoff project (PM-1082).

Generated at Thu Feb 08 04:41:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.