[SERVER-78115] Shard primaries must commit a majority write before using new routing information from the config server Created: 15/Jun/23  Updated: 02/Feb/24  Resolved: 22/Aug/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.23, 4.0.28, 4.2.24, 7.1.0-rc0, 6.0.6, 4.4.22, 5.0.18, 7.0.0-rc3
Fix Version/s: 7.1.0-rc0, 7.0.3, 6.0.12, 5.0.23

Type: Bug Priority: Major - P3
Reporter: Allison Easton Assignee: Allison Easton
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.js    
Issue Links:
Backports
Depends
depends on SERVER-78505 Database cache does not use the 'allo... Closed
depends on SERVER-80183 Remove operationTime check from store... Closed
is depended on by SERVER-79609 Fix `findAndModify_upsert.js` test to... Closed
Problem/Incident
causes SERVER-84623 Shard-local re-execution of a command... In Code Review
causes SERVER-80712 Avoid leaving the replica set shard p... Closed
is caused by SERVER-35092 ShardServerCatalogCacheLoader should ... Closed
Related
related to SERVER-30797 Shard primaries must commit a majorit... Closed
related to SERVER-79483 Investigate if tests should check ope... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.0, v6.0, v5.0
Sprint: Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-10, Sharding EMEA 2023-07-24, Sharding EMEA 2023-08-07, Sharding EMEA 2023-08-21, Sharding EMEA 2023-09-04
Participants:
Linked BF Score: 152

 Description   

In SERVER-30797, a majority write was added to the refresh path on primaries after fetching new routing information from the config server. This write ensured that the node which fetched the routing information was actually the majority primary, preventing incorrect filtering information from being applied in split brain scenarios.

This write was removed in SERVER-35092 since it was believed to be unnecessary and was causing stalls when a refresh happened without a majority of nodes available.

However, the split brain scenario for which the majority write was added is still a problem, and since the removal of that write, it is possible to hit this again. The scenario is as follows

  • Suppose we have a 2 shard cluster with 3 nodes per shard where (min, 0) is on shard 0 and (0, max) is on shard 1 with one document in each chunk
  • Now a network partition separates the primary of shard 0 from the secondaries and one of those secondaries steps up (creating a split brain scenario)
  • Chunk (0, max) is moved back to shard 0
  • A mongoS that hasn't learned about the new primary on shard 0 routes a majority read to the old primary
  • The old primary (who still believes itself to be primary) fetches the new routing information from the config

In this case, the old primary will respond to the majority read using the newest filtering information but without ever having seen the chunk migration.

This can also affect secondaries who refresh via the node that believes itself to be primary, causing their filtering information to be ahead of the data they have.

The solution here is to add back in the majority noop write to the SSCCL. It will ensure that if new filtering information is found, it can only be used and sent to secondaries by the actual primary of the replica set.



 Comments   
Comment by Githook User [ 30/Oct/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server

(cherry picked from commit bd44ce15ffa79d6234221ece8320a9f1775b8042)
Branch: v5.0
https://github.com/mongodb/mongo/commit/76a3cd802008d52d40a0c138fd74f04ad07d8043

Comment by Githook User [ 05/Oct/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server

(cherry picked from commit bd44ce15ffa79d6234221ece8320a9f1775b8042)
Branch: v6.0
https://github.com/mongodb/mongo/commit/8b0a1ea41575fa384c3de09c7e3db208bba42928

Comment by Githook User [ 19/Sep/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server

(cherry picked from commit bd44ce15ffa79d6234221ece8320a9f1775b8042)
Branch: v7.0
https://github.com/mongodb/mongo/commit/fec65ddc5c7cb7b6c90efc5fd0eece6be1892a8c

Comment by Githook User [ 22/Aug/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server
Branch: master
https://github.com/mongodb/mongo/commit/bd44ce15ffa79d6234221ece8320a9f1775b8042

Comment by Githook User [ 28/Jul/23 ]

Author:

{'name': 'Wenbin Zhu', 'email': 'wenbin.zhu@mongodb.com', 'username': 'WenbinZhu'}

Message: Revert "SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server"

This reverts commit e80686185de1633cc657ea2c00e842dc4b402470.
Branch: master
https://github.com/mongodb/mongo/commit/c584baf01c3678fbbfe4585cb7366357d48ac59b

Comment by Githook User [ 27/Jul/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-78115 Shard primaries must commit a majority write before using new routing information from the config server
Branch: master
https://github.com/mongodb/mongo/commit/e80686185de1633cc657ea2c00e842dc4b402470

Generated at Thu Feb 08 06:37:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.