[SERVER-77831] [test-only bug] CheckRoutingTableConsistency may be executing while sessions collection is being sharded Created: 06/Jun/23  Updated: 29/Oct/23  Resolved: 01/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Pierlauro Sciarelli
Resolution: Fixed Votes: 0
Labels: car-71-backport-declined, shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-79157 Change create collection coordinator ... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-10, Sharding EMEA 2023-08-21, Sharding EMEA 2023-09-04
Participants:
Linked BF Score: 5
Story Points: 2

 Description   

The CheckRoutingTableConsistency hook can sporadically fail when it is executed exactly while the sessions collection is being sharded.

The logicalSessionRefreshMillis parameter (defaulted to 5 minutes) is driving the logical session cache refresh that - when no sessions have been used during testing - basically spawns the creation+sharding of the sessions collection.

Since the refresh is asynchronous, it can totally happen for it to overlap with the execution of teardown hooks.

The failing flow is the following, happening more or less 5 minutes after the sharded cluster has been spawned for testing:

  1. [logical session refresh] Create chunk entry for the sessions collection (insert in config.chunks)
  2. [test] CheckRoutingTableConsistency lurks in looking for inconsistencies (make sure there is no document in config.chunks referring a collection UUID that is not present in config.collections)
  3. [logical session refresh] Create collection entry for the sessions collection (insert in config.collections)


 Comments   
Comment by Pierlauro Sciarelli [ 01/Sep/23 ]

Closing as "Gone away" because SERVER-79157 moved the collection/chunks documents insertion within the same transaction.

Comment by Pierlauro Sciarelli [ 30/Aug/23 ]

There is no compelling reason for insertChunks not to happen in the transaction inserting collection and placement history entries.

SERVER-79157 is currently ongoing and should move chunks creation within the transaction, that would solve the root cause of this bug. Marking this as blocked on SERVER-79157 to double-check that it can be safely closed after that ticket is committed.

Comment by Pierlauro Sciarelli [ 06/Jun/23 ]

Is a similar spurious failure not possible for other collections as they are in the midst of being sharded?

Correct, this is not possible for other collections because config.system.sessions is the only one that can be sharded "by the system" without a client requesting it. That's why it can run during teardown (after the test finished but before shutting down).

I believe a possible solution could be to transactionally insert collection and chunks entries. We may even consider doing it for all collections considering we never have to shard with "too many" chunks after SERVER-74747.

Comment by Max Hirschhorn [ 06/Jun/23 ]

Is a similar spurious failure not possible for other collections as they are in the midst of being sharded? If so, then what is making the config.system.sessions collection special in how it becomes sharded?

Generated at Thu Feb 08 06:36:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.