Fix metadata cursors

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Cursors
    • None
    • Storage Engines, Storage Engines - Foundations
    • SE Foundations - Q3+ Backlog
    • 13

      Summary

      We are currently not using metadata cursors as expected. Most places that open a cursor on the metadata file are using file cursors not metadata cursors. It is important that we use the metadata cursor (a special type of cursor) as it provides the correct isolation guarantees when performing reads on the metadata table. The function __wt_metadata_cursor is currently used in a way that it is expected to open a metadata cursor; however, it instead returns (or opens) a file cursor opened on the metadata file that is cached in the session.

      Problem

      1. __wt_metadata_cursor() in meta_table.c opens a file cursor (via __wt_metadata_cursor_open()__wt_open_cursor(session, WT_METAFILE_URI, ...))
      2. Snapshot isolation:
        • File cursors on metadata use the session's current transaction isolation level
        • Metadata cursors force WT_ISO_READ_UNCOMMITTED via WT_WITH_TXN_ISOLATION() for consistent metadata reads
      3. Session Cursor Caching: Some parts of the code expect the session cursor cache to be a file cursor. These places need to be fixed if we change this to a metadata cursor.

      This was originally investigated in WT-15259 Use uncommitted snapshot isolation for metadata reads. There's some more investigation context there, but I raised this ticket as the fix was more involved than expected. The following are the solutions I explored and problems encountered:

      Suggested Solutions

      Solution 1: Change __wt_metadata_cursor() to return metadata cursors

      Problems Encountered:

      • Recursive Loop: __wt_metadata_insert()__wt_metadata_cursor() → metadata cursor → cursor->insert()__curmetadata_insert()__wt_metadata_insert() 
      • Macro Incompatibility: WT_SESSION_META_DHANDLE(s) macro expects session->meta_cursor to be a file cursor but would need to access the underlying file cursor: ((WT_CURSOR_BTREE *)(((WT_CURSOR_METADATA *)session->meta_cursor)>file_cursor))>dhandle
      • Connection Startup Errors: Connection setup fails because metadata tracking expects file cursor behavior

      Solution 2: Use metadata new cursors where needed

      Problems Encountered:

      • Missing Method Support: Several methods are unsupported:
        • bound: __wt_cursor_config_notsup - cursor bounds not supported
        • largest_key: __wt_cursor_notsup
        • cache: __wt_cursor_notsup
        • modify: __wt_cursor_modify_notsup
        • reserve: __wt_cursor_notsup
        • reconfigure: __wt_cursor_config_notsup
        • reopen: __wt_cursor_reopen_notsup
      • get_key() Returns Wrong Values: When positioned on the metadata table entry itself (WT_MDC_ONMETADATA flag set), get_key() returns "metadata:" instead of the underlying file cursor's current key
      • get_value() Positioning Issues: The cursor positioning logic doesn't properly set the WT_CURSTD_VALUE_EXT flag in all cases, causing __wt_cursor_kv_not_set() errors

      Questions

      In the places where we use __wt_metadata_cursor, we should double check if we're okay with using the session cached metadata cursor or whether we should be creating a new cursor.

      Extra Context

      The arch-metadata guide currently states:

      To read the metadata, a caller can open a cursor, via WT_SESSION::open_cursor, with a uri equal to "metadata:". Changes to the metadata cannot be written through the metadata cursor interface i.e. the metadata cursor is a read-only interface. Metadata is only changed/affected by API calls.

      Use the following diff to find all the places where we use file cursors on the metadata file:

      Unable to find source-code formatter for language: diff. Available languages are: actionscript, ada, applescript, bash, c, c#, c++, cpp, css, erlang, go, groovy, haskell, html, java, javascript, js, json, lua, none, nyan, objc, perl, php, python, r, rainbow, ruby, scala, sh, sql, swift, visualbasic, xml, yamldiff --git a/src/cursor/cur_file.c b/src/cursor/cur_file.c
      index bc063ce727..8637e7f43e 100644
      --- a/src/cursor/cur_file.c
      +++ b/src/cursor/cur_file.c
      @@ -180,6 +180,10 @@ __curfile_next(WT_CURSOR *cursor)
            CURSOR_REPOSITION_ENTER(cursor, session);
            WT_ERR(__cursor_copy_release(cursor));
      
      +    /* Ensure cursors open on the metadata file use read uncommitted isolation. */
      +    if (strcmp(cursor->uri, WT_METAFILE_URI) == 0 && session->txn->isolation != WT_ISO_READ_UNCOMMITTED)
      +         WT_ASSERT(session, session->txn->isolation == WT_ISO_READ_UNCOMMITTED);
      +
            WT_ERR(__curfile_check_cbt_txn(session, cbt));
      
            WT_WITH_CHECKPOINT(session, cbt, ret = __wt_btcur_next(cbt, false));
      @@ -239,6 +243,10 @@ __curfile_prev(WT_CURSOR *cursor)
            CURSOR_API_CALL(cursor, session, ret, prev, cbt->dhandle);
            CURSOR_REPOSITION_ENTER(cursor, session);
            WT_ERR(__cursor_copy_release(cursor));
      +    
      +    /* Ensure cursors open on the metadata file use read uncommitted isolation. */
      +    if (strcmp(cursor->uri, WT_METAFILE_URI) == 0)
      +         WT_ASSERT(session, session->txn->isolation == WT_ISO_READ_UNCOMMITTED);
      
            WT_ERR(__curfile_check_cbt_txn(session, cbt));
      
      @@ -308,6 +316,10 @@ __curfile_search(WT_CURSOR *cursor)
            WT_ERR(__cursor_copy_release(cursor));
            WT_ERR(__cursor_checkkey(cursor));
      
      +    /* Ensure cursors open on the metadata file use read uncommitted isolation. */
      +    if (strcmp(cursor->uri, WT_METAFILE_URI) == 0)
      +         WT_ASSERT(session, session->txn->isolation == WT_ISO_READ_UNCOMMITTED);
      +
            WT_ERR(__curfile_check_cbt_txn(session, cbt));
      
            time_start = __wt_clock(session);
      @@ -346,6 +358,10 @@ __wti_curfile_search_near(WT_CURSOR *cursor, int *exact)
            WT_ERR(__cursor_copy_release(cursor));
            WT_ERR(__cursor_checkkey(cursor));
      
      +    /* Ensure cursors open on the metadata file use read uncommitted isolation. */
      +    if (strcmp(cursor->uri, WT_METAFILE_URI) == 0)
      +         WT_ASSERT(session, session->txn->isolation == WT_ISO_READ_UNCOMMITTED);
      +
            WT_ERR(__curfile_check_cbt_txn(session, cbt));
      
            time_start = __wt_clock(session);
      

      From the current metdata cursor code comments:

               * When applications open metadata cursors, they expect to see all schema-level operations
               * reflected in the results. Query at read-uncommitted to avoid confusion caused by the
               * current transaction state.
      
      and
      
          * All metadata reads are at read-uncommitted isolation. That's because once a schema-level
           * operation completes, subsequent operations must see the current version of checkpoint
           * metadata, or they may try to read blocks that may have been freed from a file. Metadata
           * updates use non-transactional techniques (such as the schema and metadata locks) to protect
           * access to in-flight updates.
      

      There are also some more notes here WT-15259 Notes

            Assignee:
            [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            Sean Watt
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated: