Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Critical - P2
Fix Version/s: None
Affects Version/s: None
Component/s: Replication
Labels:
- RF

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

General note: I know that the title is too general, but this is the 3rd bug I'm opening this week. We have another one coming for 3.2.1 related to sharding which we will soon publish. We are thinking of moving out of mongodb, the reliability of 3.2 is horrible!

2 bugs in this ticket:

1. We removed a member using rs.remove(). After that - the removed member (of which the log is attached) - started a versioning mess and killed itself.
filename = crash.

2. 2nd time we got the following behavior: a member selects itself, although it doesn't need to, and causes a rollback of the other member.
Our setup: primary, secondary and arbiter.
Primary: rs.stepDown() for maintenance.
Secondary takes over.
When primary is back, it starts syncing, as you can see from the logs - during this time it receives 2 "no" votes since it is still stale, but then - it receives only 1 "yes" vote (for some reason, the arbiter is quiet) - and is elected before its time. This causes a rollback on the other node.
All 3 nodes' logs are attached (primary, secondary, are). Please note the following lines:

2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2016-02-07T09:38:14.612+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2arb.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }

and after 9 seconds, suddenly:

2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] VoteRequester: Got no vote from in.db2m2.mydomain.com:27017 because: candidate's data is staler than mine, resp:{ term: 3, voteGranted: false, reason: "candidate's data is staler than mine", ok: 1.0 }
2016-02-07T09:38:25.613+0000 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 4
2016-02-07T09:38:25.614+0000 I REPL     [ReplicationExecutor] transition to PRIMARY

All members in protocol version 1. They were 0 but upgraded according to your docs ~a week ago.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

arb
Feb 07 2016 10:26:40 AM UTC
3 kB
Yoni Douek
crash
Feb 07 2016 10:26:40 AM UTC
19 kB
Yoni Douek
primary
Feb 07 2016 10:26:40 AM UTC
25 kB
Yoni Douek
secondary
Feb 07 2016 10:26:40 AM UTC
53 kB
Yoni Douek

duplicates

SERVER-23663 New primary syncs from chosen node to catch up with timeout

Closed

SERVER-18453 Avoiding Rollbacks in new Raft based election protocol

Closed

is related to

SERVER-22504 Do not blindly add self to heartbeat member data array in the TopologyCoordinator

Closed

SERVER-11086 Election handoff to new primary, during stepdown

Closed

Assignee:: Eric Milkie
Reporter:: Yoni Douek
Participants:: Andy Schwerin, Eric Milkie, Yoni Douek
Votes:: 0 Vote for this issue
Watchers:: 22 Start watching this issue

Created:: Feb 07 2016 10:26:40 AM UTC
Updated:: Sep 21 2016 09:25:10 PM UTC
Resolved:: Sep 21 2016 09:25:09 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates