[SERVER-29265] STARTUP2 and WriteConcern=majority don't work well together Created: 18/May/17  Updated: 27/Oct/23  Resolved: 18/May/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: New Feature Priority: Major - P3
Reporter: Jan-Kees van Andel Assignee: Andy Schwerin
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:

 Description   

Hi,

Problem:

We just did a rolling upgrade from 3.0.14 to 3.4.3 and we noticed some strange behavior while doing the initial sync. All writes take 5 seconds to complete, making the app very slow and leading to timeouts and finally exhausted connection pools.

Our set-up:

The app is a Play-Scala application using the ReactiveMongo driver (0.12.2). We use w: majority and wtimeout: 5000 as our write concern.
The DB is a replica set of three data bearing nodes, with 2 arbiters.
Everything is hosted on AWS EC2, using EBS volumes.

It's a pretty big database, the initial sync takes around 1,5 hours, incl. index creation.

Steps

  • We add the new instances to the replica set (on top of the three other nodes), and they start to do an initial sync (from scratch, so not from an AWS snapshot).
  • Once the new node is in the cluster, all writes start to take exactly 5 seconds (the wtimeout value). We don't see much load on the app or on the primary though.
  • After digging in to the problem, we reconfigured the new nodes with priority:0 and votes:0 and from that moment onwards the whole app is working properly again.

So we have a kind of workaround, but having to use this workaround sounds undesirable. It also looks a bit weird to me, since the rest of the cluster is quick, there is no need to wait for the new instances the get a majority at all, right? 3 nodes out of 5 respond quick, so that should be enough to return.

Also, during the initial sync, I don't see the need for writes to be sent to the node, since it's not caught up by far and afterwards it will replay these writes anyway.

Is it possible to ignore nodes that are in STARTUP2 when replicating writes? Or maybe another solution might be to quickly return a special acknowledgement when the node is in STARTUP2, so the primary can return the write back to the application?



 Comments   
Comment by Andy Schwerin [ 18/May/17 ]

To prevent w:majority writes from rolling back, it is mandatory that the majority of voting nodes confirm writes before the client responds. Other behaviors could lead to loss of w:majority-confirmed writes in the event of an election. As such, the two-phase process for adding new nodes is advisable, and implemented in MongoDB's opsmanager and cloudmanager products.

Having two arbiters is a bit of a surprise. Because you have two arbiters and 3 data-bearing nodes to begin with, all three data bearing nodes must confirm every w:majority write (as opposed to only two of them if you had no arbiters). When you add the 4th data bearing node, the total number of nodes in the system is now 6, and the number of nodes that must confirm majority writes increases from 3 to 4, giving you the behavior you're experiencing.

Your current architecture will make it impossible to confirm w:majority writes when 1 data bearing node fails. Given that you have three (soon to be 4) data bearing nodes, you should consider whether the arbiters are providing you with any benefit.

Generated at Thu Feb 08 04:20:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.