[SERVER-56882] unable to complete full validation on collection after failed hashed index insert Created: 12/May/21  Updated: 29/Oct/23  Resolved: 28/Sep/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Benety Goh Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File validate_empty_collection.js     File validate_hashed_index.js    
Issue Links:
Depends
depends on WT-8126 Mark btree as dirty only if not newly... Closed
Related
is related to SERVER-56877 insert operations may fail to set ind... Closed
is related to WT-7750 exclusive handle access fails if cach... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2021-10-04, Storage - Ra 2021-09-06
Participants:
Linked BF Score: 18
Story Points: 0

 Description   

Add regression tests for WT-8126 to jstests/noPassthrough.

The validation warnings reported previously in this ticket have been resolved in the storage engine in this commit containing WT-8126. The work remaining for this ticket is to provide regression coverage at the server level.


OLD DESCRIPTION

A failed insert into a collection with a hashed index cause the server to end up in a state where validate({full: true}) fails to complete and returns "Could not complete validation of table:index-NNN. This is a transient issue as the collection was actively in use by other operations." warnings.

This state seems to persist for an extended period in spite of checkpointing activity observed in the server logs.

It would be useful to understand if this is an issue in the storage engine or at the integration layer above it.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 28/Sep/21 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-56882 add js test for reproducing transient validation warnings with hashed indexes
Branch: master
https://github.com/mongodb/mongo/commit/efeb503a68bbbc5283f3e3d813ae189ab70984dc

Comment by Githook User [ 27/Sep/21 ]

Author:

{'name': 'Benety Goh', 'email': 'benety@mongodb.com', 'username': 'benety'}

Message: SERVER-56882 add js test to ensure a full validation runs without warnings on an empty collection
Branch: master
https://github.com/mongodb/mongo/commit/8f7c3dd749c5f2a6c410e19e78fad96fb0bf03cb

Comment by Benety Goh [ 22/Sep/21 ]

etienne.petrel, thanks for chasing this one down. I've verified that the JS tests attached to this ticket are now passing and hope to include these two tests in our CI system soon.

Comment by Etienne Petrel [ 21/Sep/21 ]

benety.goh, WT-8126 has been merged and should be vendored in MongoDB repo later today. Thanks for your patience!

Comment by Benety Goh [ 21/Sep/21 ]

etienne.petrel, thank you for the update!

Comment by Etienne Petrel [ 21/Sep/21 ]

benety.goh, I have set this ticket to depend on WT-8126 as it specifically addresses the issue you are observing. WT-7750 is now related to WT-8126. The work of WT-8126 will be in code review today.

Comment by Etienne Petrel [ 09/Sep/21 ]

I will be posting updates about the investigation in the WT ticket from now on, I hope it is ok.

Comment by Benety Goh [ 08/Sep/21 ]

etienne.petrel, thank you for looking into this. Let us know if you need any assistance on the reproducers.

Comment by Etienne Petrel [ 06/Sep/21 ]

I realized there is an existing WT ticket that describes the same issue, I am linking the two. Assigning this one back to you benety.goh.

Comment by Etienne Petrel [ 03/Sep/21 ]

We end up in the session_verify where we call wt_schema_worker. The first two URIs passed on to the function are: 

Until here everything works fine, wt_schema_worker does not return any error.

But then we call wt_schema_worker again with the following URIs:

The URI that starts with file calls wt_exclusive_handle_operation which calls wt_conn_dhandle_close_all and itself calls conn_dhandle_close_one:

    /*
     * Lock the live handle first. This ordering is important: we rely on locking the live handle to
     * fail fast if the tree is busy (e.g., with cursors open or in a checkpoint).
     */
    WT_ERR(__conn_dhandle_close_one(session, uri, NULL, removed, mark_dead)); 

This call returns EBUSY. I could eventually find the piece of code that returns EBUSY:

    /*
     * Don't flush data from modified trees independent of system-wide checkpoint when either there
     * is a stable timestamp set or the connection is configured to disallow such operation.
     * Flushing trees can lead to files that are inconsistent on disk after a crash.
     */
    if (btree->modified && !bulk && !__wt_btree_immediately_durable(session) &&
      (S2C(session)->txn_global.has_stable_timestamp ||
        (!F_ISSET(S2C(session), WT_CONN_FILE_CLOSE_SYNC) && !metadata)))
        return (__wt_set_return(session, EBUSY)); 

I will need to investigate more why this happens on the index and not on the collection.

Comment by Etienne Petrel [ 03/Sep/21 ]

Thank you benety.goh for your reproducers. I will be having a look today and get back to you.

Generated at Thu Feb 08 05:40:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.