[SERVER-33954] CatalogCache refresh methods are not causally consistent Created: 16/Mar/18  Updated: 29/Oct/23  Resolved: 05/Jun/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.3, 3.7.3
Fix Version/s: 4.0.0-rc3, 4.1.1

Type: Bug Priority: Major - P3
Reporter: Janna Golden Assignee: Matthew Saltz (Inactive)
Resolution: Fixed Votes: 0
Labels: ShardingTechDebt, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File repro.diff     File stale_mongos_updates_and_removes.js     Text File stalemongosfail.txt     File test.js    
Issue Links:
Backports
Depends
Duplicate
is duplicated by SERVER-31659 Investigate causal consistency violat... Closed
Related
related to SERVER-43520 Complete TODO listed in SERVER-33954 Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: Sharding 2018-04-09, Sharding 2018-04-23, Sharding 2018-05-07, Sharding 2018-05-21, Sharding 2018-06-04, Sharding 2018-06-18
Participants:
Linked BF Score: 73

 Description   

The CatalogCache refresh methods have a logic that will simply 'join' with an another thread if there is already another thread trying to refresh the same collection. This will cause the refresh to miss the changes that happened at Tn if the refresh method was called while an in-progress refresh started at Tn-1.

Here's a concrete example based on the build failure:
1. moveChunk command is called.
2. At the end of the migration, source shardA sends setShardVersion command to recipient shardB asynchronously.
3. setShardVersion command triggers a CatalogCache refresh at shardB.
4. Since step 3 is happening asynchronously, the test can proceed to dropping the collection.
5. At the final step of drop is to send setShardVersion (0, 0) to shards.
6. When setShardVersion arrives at shardB while step 3 is still refreshing, it will simply join with it and get an old info where the collection has not dropped yet.
7. setShardVersion on drop fails since version (0, 0) would not match the version found in the refresh.

Original description:

After `test.foo` is dropped, the drop is not reflected in the catalogCache. After refreshing, the shard logs that it refreshed from the old version:

Refresh for collection test.foo took 92 ms and found version 3|0||5aac36f988c1f30185b7b1df

and the test fails because when trying to setShardVersion during the drop, the correct version is sent (version 0|0||000000000000000000000000), but the incorrect version is found.

Attached is the test to reproduce the error as well as the logs.



 Comments   
Comment by Githook User [ 05/Jun/18 ]

Author:

{'name': 'Matthew Saltz', 'email': 'matthew.saltz@mongodb.com'}

Message: SERVER-33954 Modified getCollectionRoutingInfoWithRefresh to refresh twice if the first refresh is not performed by its own thread

(cherry picked from commit b93fe0e61bf7e8bc96da2edeb66afa1b915b0b77)
Branch: v4.0
https://github.com/mongodb/mongo/commit/8aa6f18081ac029e4514082d434f99fb57f8d630

Comment by Githook User [ 05/Jun/18 ]

Author:

{'name': 'Matthew Saltz', 'email': 'matthew.saltz@mongodb.com'}

Message: SERVER-33954 Modified getCollectionRoutingInfoWithRefresh to refresh twice if the first refresh is not performed by its own thread
Branch: master
https://github.com/mongodb/mongo/commit/b93fe0e61bf7e8bc96da2edeb66afa1b915b0b77

Comment by Githook User [ 26/Apr/18 ]

Author:

{'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}

Message: Revert "SERVER-33954 Modified getDatabaseWithRefresh/getCollectionRoutingInfoWithRefresh to refresh twice if the first refresh is not performed by its own thread"

This reverts commit a000fcd684216a331356a3c1568ef7fa99ea4907.
Branch: master
https://github.com/mongodb/mongo/commit/316bcc98e2b89e266493612ee1cf4680a0265e0f

Comment by Matthew Saltz (Inactive) [ 24/Apr/18 ]

Note to self: After discussing with esha.maharishi and renctan, I need to go back and look at where getCollectionRoutingInfoWithRefresh to ensure it's not called anywhere that will be problematic for performance

Comment by Githook User [ 23/Apr/18 ]

Author:

{'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}

Message: SERVER-33954 Modified getDatabaseWithRefresh/getCollectionRoutingInfoWithRefresh to refresh twice if the first refresh is not performed by its own thread
Branch: master
https://github.com/mongodb/mongo/commit/a000fcd684216a331356a3c1568ef7fa99ea4907

Comment by Randolph Tan [ 04/Apr/18 ]

Attaching repro.diff and test.js that can easily reproduce this issue.

Generated at Thu Feb 08 04:35:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.