[SERVER-62394] Follow up work for SERVER-61131 Created: 06/Jan/22  Updated: 29/Oct/23  Resolved: 27/Jan/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.3.0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Christopher Caplinger
Resolution: Fixed Votes: 0
Labels: shard-merge-milestone-1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-61131 Store backup cursor results and set s... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Server Serverless 2022-01-24, Server Serverless 2022-02-07
Participants:

 Description   

The contract is that fetcher callback should not throw exceptions,  breaching that contract causes the server to crash (due to uncaught exceptions). 

[js_test:tenant_migration_donor_try_abort] d20067| {"t":{"$date":"2022-01-04T18:44:02.375+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"TenantMigrationRecipientService-1","msg":"Writing fatal message","attr":{"message":"terminate() called. An exception is active; attempting to gather more information"}}
[js_test:tenant_migration_donor_try_abort] d20067| {"t":{"$date":"2022-01-04T18:44:02.375+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"TenantMigrationRecipientService-1","msg":"Writing fatal message","attr":{"message":"DBException::toString(): InterruptedDueToReplStateChange: operation was interrupted\nActual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)11602, mongo::ExceptionForCat<(mongo::ErrorCategory)2>, mongo::ExceptionForCat<(mongo::ErrorCategory)3>, mongo::ExceptionForCat<(mongo::ErrorCategory)14> >\n"}}
 

We do some write ops (create collection and insert documents)  in _donorFilenameBackupCursorFileFetcher's callback. If a node shuts down or steps down, we interrupt the opCtx used by the writeOps that causes the write operation to fail and throw.



 Comments   
Comment by Githook User [ 27/Jan/22 ]

Author:

{'name': 'Christopher Caplinger', 'email': 'christopher.caplinger@mongodb.com', 'username': 'UnicodeSnowman'}

Message: SERVER-62394: Follow-up work for SERVER-61131
Branch: master
https://github.com/mongodb/mongo/commit/fb529499ff831684ee281bd13486b62b2bbcc3a3

Comment by Suganthi Mani [ 20/Jan/22 ]

Also, just saw that we currently hold mutex and do disk-write which is not a best practice. Mutex locks are taken for short-critical section. christopher.caplinger could you please address this in this ticket as well?

Comment by Suganthi Mani [ 06/Jan/22 ]

Pasting the slack conversation 

We can fix this by  by adding  wrapping those write ops in try{..}catch() block.*But my recommendation is to use write_ops_exec::performInserts*() (It’s the entry point to user insert commands).
1) It has logic to _convertExceptionToReturnStatus()
2) we are currently  taking X lock on Config database (it’s not correct). We should take IX lock for the Database. Otherwise, it will block any operations in other collections present in config database.

  • We don’t need to take collection or db locks in the recipient file, performInserts will take necessary lock.
  • Also, if the collection is not present, it will create the collection implicitly before insert.

3) We should disable the document validation for this reason mentioned here. In the current code, we do collection validation().

4) I am not recommending StorageInterfaceImpl::insertDocuments(), it has some limitations. In case, in future, if we try to batch write and do it in single WriteUnitOfWork. Here is an example, where we use performInserts() to insert documents to a replicated collection in config db

 
Other things I noted:

1) After updating the in-memory state doc state , we are not persisting the updated state doc. It seems it’s depending on this step for state doc persistence, that is  after killing backup cursor, and that’s wrong.  I believe, my ticket will add logic of copying the donor files to recipient after updating the state and before killing the backup cursor. We need to inform secondaries before the primary starts the file cloning and killing backup cursor.
2)  minor NIT: It seems you recently wrapped this shouldStopSendingRecipientSyncDataCommand() in mutex and I assume it’s due to calling getProtocol(). We don’t need to take any mutex lock to read getProtocol() which reads _protocol value.

Generated at Thu Feb 08 05:55:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.