[SERVER-64730] The 'forceShardFilteringMetadataRefresh' methods don't synchronise with each other (5.0 and newer versions) Created: 21/Mar/22  Updated: 29/Oct/23  Resolved: 21/Dec/22

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.5, 5.1.1, 4.2.19, 5.2.1, 4.4.13, 5.3.0-rc4
Fix Version/s: 6.1.1, 5.0.14, 6.0.2, 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: [DO NOT USE] Backlog - Sharding EMEA
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
is caused by SERVER-40258 Relax locking requirements for shardi... Closed
Related
related to SERVER-72322 The 'forceShardFilteringMetadataRefre... Open
related to SERVER-42838 A slow thread in forceShardFilteringM... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.1, v6.0, v5.0
Sprint: Sharding EMEA 2022-04-04, Sharding EMEA 2022-04-18, Sharding EMEA 2022-05-02, Sharding EMEA 2022-05-30, Sharding EMEA 2022-06-13, Sharding EMEA 2022-06-27, Sharding EMEA 2022-07-11, Sharding EMEA 2022-07-25, Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05, Sharding EMEA 2022-09-19, Sharding EMEA 2022-10-03, Sharding EMEA 2022-10-17, Sharding EMEA 2022-10-31, Sharding EMEA 2022-11-14, Sharding EMEA 2022-12-12
Participants:

 Description   

The forceShardFilteringMetadataRefresh method is the lowest-level shard version causality utility on the shards, whose purpose is to always move the shard version forward.

In versions 4.0 and earlier, it used to acquire collection X lock and check that the newly installed version is actually newer than the one on the CSS before installing it. Starting from version 4.2 though, as part of the transaction project it was changed to not acquire collection X-lock.

This means that two concurrent invocations of forceShardFilteringMetadataRefresh could potentially race with each other and install non-monotonous increasing versions (i.e., the shard version on a shard can go back in time).

UPDATE
After working a bit on this ticket and backporting it to previous versions. we believe it has already been addressed in 5.0 and newer versions (see the fix version to understand in which minor version the fix landed). Long story short, when a DDL operation is installing new metadata using the critical section, we cancel any ongoing onShardVersionMismatch metadata refresh, so we don't have to worry about the interleaving of these two operations. Note that any onShardVersionMismatch that arrives after the critical section is acquired will block behind it. The same happens when we clear the filtering metadata.

The versions that still have this bug are 4.4 and 4.2. I propose to perform an investigation about these two versions and open a new ticket about how to fix it. Sending it to Needs Scheduling so we properly triage this task. SERVER-72322 will track this issue on 4.4 and 4.2 branches.



 Comments   
Comment by Kelsey Schubert [ 16/Dec/22 ]

sergi.mateo-bellido@mongodb.com, could you clarify the status of this ticket and where this issue is resolved for me?

It's currently open, but I see a lot of commits and fixversions.

Comment by Githook User [ 13/Oct/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: SERVER-64730 Interrupt ongoing refreshes after entering into the critical section

(cherry picked from commit 343108041c5b3570e97418ee3204804535fbde4d)
Branch: v6.1
https://github.com/mongodb/mongo/commit/7841ccaf02170df756499cbe3258171e4567db45

Comment by Githook User [ 03/Oct/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: SERVER-64730 Interrupt ongoing refreshes after entering into the critical section

Minor changes to resharding_test_fixture to work with legacy OP_QUERY
(cherry picked from commit 343108041c5b3570e97418ee3204804535fbde4d)
Branch: v5.0
https://github.com/mongodb/mongo/commit/c2a779e969333570bcaa7c8f03a59a27c5250d15

Comment by Githook User [ 14/Sep/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: Revert "SERVER-64730 Interrupt ongoing refreshes after entering into the critical section"

This reverts commit e5de2ad1815d9a9a6a0783c520985cdb0d2a3f06.
Branch: v5.0
https://github.com/mongodb/mongo/commit/d23fde882fbf40b56b906a25d4f1a9c127574b99

Comment by Githook User [ 14/Sep/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: SERVER-64730 Interrupt ongoing refreshes after entering into the critical section

(cherry picked from commit 343108041c5b3570e97418ee3204804535fbde4d)
Branch: v6.0
https://github.com/mongodb/mongo/commit/c94a993632dccda780e233f9e19f943c2a4f1707

Comment by Githook User [ 14/Sep/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: SERVER-64730 Interrupt ongoing refreshes after entering into the critical section

(cherry picked from commit 343108041c5b3570e97418ee3204804535fbde4d)
Branch: v5.0
https://github.com/mongodb/mongo/commit/e5de2ad1815d9a9a6a0783c520985cdb0d2a3f06

Comment by Githook User [ 07/Sep/22 ]

Author:

{'name': 'Sergi Mateo Bellido', 'email': 'sergi.mateo-bellido@mongodb.com', 'username': 'smateo'}

Message: SERVER-64730 Interrupt ongoing refreshes after entering into the critical section
Branch: master
https://github.com/mongodb/mongo/commit/343108041c5b3570e97418ee3204804535fbde4d

Generated at Thu Feb 08 06:01:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.