[SERVER-70784] Create TTL index for config.sampledQueries and config.sampledQueriesDiff on stepup Created: 23/Oct/22  Updated: 29/Oct/23  Resolved: 25/Jan/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.3.0-rc0

Type: Task Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Israel Hsu
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Problem/Incident
Backwards Compatibility: Fully Compatible
Sprint: Sharding NYC 2022-11-28, Sharding 2022-12-12, Sharding NYC 2022-12-26, Sharding NYC 2023-01-09, Sharding NYC 2023-01-23, Sharding NYC 2023-02-06
Participants:
Linked BF Score: 22

 Description   
  • Add mongod server parameter "sampledQueriesExpirationSeconds" (defaults to 7 days).
  • Make config.sampledQueries and config.sampledQueriesDiff documents have an "expireAt" field set to "sampledQueriesExpirationSeconds" after the insert time.
  • Create a [\{key: \{expireAt: 1\},  \{expireAfterSeconds: 0\}] index for these collections upon stepup or setFCV upgrade if it doesn't already exist.


 Comments   
Comment by Githook User [ 25/Jan/23 ]

Author:

{'name': 'Israel Hsu', 'email': 'israel.hsu@mongodb.com', 'username': 'israelhsu'}

Message: SERVER-70784 Create TTL index for config.sampledQueries and config.sampledQueriesDiff
Branch: master
https://github.com/mongodb/mongo/commit/ec6297a9b6460c60a416e1daa4bddeb217514868

Comment by Israel Hsu [ 18/Jan/23 ]

Instead of a javascript test, I implemented unit tests that uses the failCommand failpoint to cause createIndexes to return certain errors such as IndexAlreadyExists and PrimarySteppedDown.

Comment by Israel Hsu [ 12/Jan/23 ]

Our case is simpler than what Pierlauro mentioned. I implemented the following: The thread for creating the TTL indexes can retry until success (either the indexes are created or they already exist), or a NotPrimaryError is caught. If the indexes have not been created, the new primary will attempt to create them onStepUpComplete.

I'm working on a javascript test that might test this. There is a test in `internal_transactions_reap_service_test.js` called DoesNotReapAsSecondaryAndClearsSessionOnStepdown that manipulates the state of the replica in order to test associated behaviors.

Comment by Israel Hsu [ 11/Jan/23 ]

Chou rightly mentions that the TTL index creation that gets asynchronously started onStepUpComplete() might take time, during which the server could step down again. The index-creation thread should be stopped or canceled if this happens – that is, not allowed to retry forever. Pierlauro mentioned that this pattern of execution (starting a thread on step-up and stopping it on step-down) is common and provided some hints to implement this. His comment is copied to the comment in the PR

I'm working on implementing this for QueryAnalysisWriter.

Comment by Israel Hsu [ 21/Dec/22 ]

@Jack Mulrow advised to check with the replication team about making local writes (in our case, creating the TTL index) in the onStepUpComplete hook. In ReplicationCoordinatorImpl::signalDrainComplete(), onStepUpComplete is invoked before _canAcceptNonLocalWrites is set, so index creation may fail. Possible solutions include using an executor to create the indexes a little bit later and/or retry on failure; or using a lower-level API than DBClient to avoid the _canAcceptNonLocalWrites check.

Generated at Thu Feb 08 06:17:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.