Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.5
Affects Version/s: 7.3.0-rc0, 8.0.0-rc0
Component/s: None
Labels:
None

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v8.0
Steps To Reproduce:

Hide

Run the attached repro.js test in the "sharding" suite. I've tested it on r8.1.0-alpha-3304-g85bc2e2ee02.

For older version you can use the repro_old.js instead.

Show
Run the attached repro.js test in the "sharding" suite. I've tested it on r8.1.0-alpha-3304-g85bc2e2ee02 . For older version you can use the repro_old.js instead.
Sprint:
CAR Team 2024-09-16, CAR Team 2024-09-30
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

A shard role nested into a router role does not handle StaleConfig exception correctly, breaking the shard versioning protocol.

In particular, the StaleConfig exception will be caught and handled by the RouterRole that will invalidate and refresh the catalog cache and retry the operation without updating the Database/Collection Sharding State (CSS/DSS).

Manifestation

In most of the cases, this will simply cause additional latency in the execution of the query/command because the router role will retry 10 times before bubbling up the error. This will let the ServiceEntryPoint on the shard to finally update the DSS/CSS.

In case we are executing inside a transaction the situation is worst, and it could happen that the transaction will never succeed even if the driver keeps retrying it.
In fact, due to the execution of the transaction, the shard needs to grab locks for the collection before to enter into the router role. This implies that after 10 retries the Router Role will bubble up ShardCannotRefreshDueToLocksHeld instead of StaleConfig. When this error reaches the service entry point we only refresh the catalog cache but not the DSS/CSS.

This is one example of where we used nested shard role inside router role. So transactions over views are definitely affected by this problem.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

repro_old.js
Sep 05 2024 02:26:59 PM UTC
1 kB
Tommaso Tocci
repro.js
Sep 05 2024 02:26:59 PM UTC
1 kB
Tommaso Tocci
repro-lookup.js
Sep 10 2024 05:17:45 PM UTC
1 kB
Jordi Serra Torrens

is caused by

SERVER-81233 Prevent kickback to router when reading from views on unsplittable collections located on the db-primary

Closed

is related to

SERVER-97013 Adjust 8.0.4 Backports

Closed

related to

SERVER-77402 Create ShardRole retry loop utility

Backlog

Assignee:: Jordi Serra Torrens
Reporter:: Tommaso Tocci
Participants:: Githook User, Jordi Serra Torrens, Tommaso Tocci
Votes:: 0 Vote for this issue
Watchers:: 13 Start watching this issue

Created:: Sep 05 2024 02:26:13 PM UTC
Updated:: Jan 15 2025 12:08:29 PM UTC
Resolved:: Sep 18 2024 08:21:35 AM UTC
Confidence Status Last Update:: 09/Sep/24 1:51 PM

Details

Description

Manifestation

Attachments

Attachments

Issue Links

Forms

Activity

People

Dates