[SERVER-60580] Relax timings in change_stream_shard_failover to make it less flaky Created: 08/Oct/21  Updated: 18/Oct/21  Resolved: 18/Oct/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.17
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Andrew Shuvalov (Inactive)
Resolution: Won't Fix Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Operating System: ALL
Sprint: Sharding 2021-10-18, Sharding 2021-11-01
Participants:
Linked BF Score: 0

 Description   

I don't see any actual failure in the logs of failed change_stream_shard_failover.js test except it spends too much time in some of the steps:

  1. The `replSetStepDown: 300` command might be blocking the former primary re-election for too long. Perhaps try the timeout of 60 or 100
  2. The `awaitNodesAgreeOnPrimary()` default timeout is 10 minutes. Perhaps wait for some smaller timeout. I see in the logs this step sometimes takes too long, maybe because of #1

I would also add better logging around those cases.



 Comments   
Comment by Andrew Shuvalov (Inactive) [ 18/Oct/21 ]

Need more investigation...

Generated at Thu Feb 08 05:50:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.