[SERVER-37051] ShardServerCatalogCacheLoader does not check the internal term after reading from the task queue Created: 07/Sep/18  Updated: 29/Oct/23  Resolved: 11/Sep/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.7, 4.0.2
Fix Version/s: 3.6.10, 4.0.5, 4.1.3

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kaloian Manassiev
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-09-24
Participants:
Linked BF Score: 57

 Description   

There is a race condition in ShardServerCatalogCacheLoader, where if a shard node is running as a primary and a step-down happens, it may read in-memory task queue and persisted cache state, which is not consistent.

Specifically, consider a node which is a primary and found some data reading from the config server here. It will then schedule this data to be persisted to the cache collections and then will proceed to do a merge of the task queue + what's already persisted in order to produce a list of the changed chunks.

In a stepdown-free case, this would work fine. However, if by the time it got to read what it persisted and what is on the queue, the node stepped down, neither the write to the cache collections could have happened, nor anything remained on the task queue because of the change in term. That way it could come back with incomplete data (which would be a data loss) or it could come back with an empty list, which will invariant.

In order to fix it, after reading from the task queue + persisted cache, we should check if the term has changed here and throw ConflictingOperationInProgress error so the load can be retried as secondary.



 Comments   
Comment by Githook User [ 21/Nov/18 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-37051 Check for term change after fetching the queued metadata in ShardServerCatalogCacheLoader

(cherry picked from commit fe8f517a59d694b7577da564d19e4415e13831e8)
(cherry picked from commit 2745f873818a6a1689d8538f2a29f12e221c7af5)
Branch: v3.6
https://github.com/mongodb/mongo/commit/ed7228be701f5435553ae128718e6b7490376b84

Comment by Githook User [ 16/Nov/18 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-37051 Check for term change after fetching the queued metadata in ShardServerCatalogCacheLoader

(cherry picked from commit fe8f517a59d694b7577da564d19e4415e13831e8)
Branch: v4.0
https://github.com/mongodb/mongo/commit/2745f873818a6a1689d8538f2a29f12e221c7af5

Comment by Githook User [ 11/Sep/18 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-37051 Check for term change after fetching the queued metadata in ShardServerCatalogCacheLoader
Branch: master
https://github.com/mongodb/mongo/commit/fe8f517a59d694b7577da564d19e4415e13831e8

Comment by Kaloian Manassiev [ 07/Sep/18 ]

Yes - my mistake, forgot to add it.

Comment by Gregory McKeon (Inactive) [ 07/Sep/18 ]

kaloian.manassiev is this 4.1 required as well?

Generated at Thu Feb 08 04:44:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.