[SERVER-51100] Make dry-run elections write to lastVote like real elections do Created: 22/Sep/20  Updated: 07/Apr/23

Status: Blocked
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Amar Hamzeh Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-64068 Relax requirement on catchup takeover... Open
Duplicate
is duplicated by SERVER-57612 Investigate whether we can perform a ... Closed
Related
related to SERVER-71536 Investigate if non-replicated collect... Backlog
is related to SERVER-75007 Rollback shouldn't block elections Backlog
Assigned Teams:
Replication
Sprint: Replication 2021-12-27, Replication 2022-01-24, Replication 2022-02-07, Repl 2022-02-21
Participants:

 Description   

As a result, nodes blocked on Global locks can vote in dry runs but would block on actual elections.

Here is an example:

dry run:

2020-09-18T09:55:22.110+1000 I COMMAND  [conn15] command admin.$cmd command: replSetRequestVotes { replSetRequestVotes: 1, setName: "replset", dryRun: true, term: 1, candidateIndex: 2, configVersion: 1, lastCommittedOp: { ts: Timestamp(1600386902, 1), t: 1 }, $clusterTime: { clusterTime: Timestamp(1600386902, 1), signature: { hash: BinData(0, 6DB77428788E3FCF244CC5F5AD4BCBE5DC14E2FB), keyId: 6873609104389046273 } }, $db: "admin" } numYields:0 reslen:204 locks:{} protocol:op_msg 0ms

actual election:

2020-09-18T09:55:22.130+1000 I COMMAND  [conn15] command local.replset.election command: replSetRequestVotes { replSetRequestVotes: 1, setName: "replset", dryRun: false, term: 2, candidateIndex: 2, configVersion: 1, lastCommittedOp: { ts: Timestamp(1600386902, 1), t: 1 }, $clusterTime: { clusterTime: Timestamp(1600386902, 1), signature: { hash: BinData(0, 6DB77428788E3FCF244CC5F5AD4BCBE5DC14E2FB), keyId: 6873609104389046273 } }, $db: "admin" } numYields:0 reslen:204 locks:{ Global: { acquireCount: { r: 2, w: 1 } }, Database: { acquireCount: { r: 1, w: 1 } }, Collection: { acquireCount: { r: 1, w: 1 } } } storage:{} protocol:op_msg 6ms

Should we consider acquiring intent locks on dry runs?



 Comments   
Comment by Frederic Vitzikam [ 02/Mar/22 ]

This is blocked by SERVER-64068 as the primary vote is required for catchup takeover dry run but the primary might be unable to do a write (either due to disk issue as in the HELP ticket or because hangBeforeReconfigOnDrainComplete caused it to hang while holding the RSTL like in catchup_takeover_with_higher_config.js).

Comment by Siyuan Zhou [ 28/Sep/20 ]

If I understand correctly, the problem described in SERVER-21307 stopped acquiring the global lock in X mode on secondaries in 4.2 due to SERVER-39372. Thus secondary index building won't conflict with voting acquiring the global lock in IX mode. This issue only happens on 4.0. CC geert.bosch to confirm.

Comment by Tess Avitabile (Inactive) [ 28/Sep/20 ]

Thanks for reporting this, amar.hamzeh. I'm changing this to an Improvement, rather than a Bug. I agree that this would provide better behavior in some cases, since we would avoid losing the primary. Any deadlock when this feature would be helpful should be treated as a Bug, and we should fix it.

Generated at Thu Feb 08 05:24:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.