[SERVER-31573] WT table not dropped on primary after collection is dropped Created: 14/Oct/17  Updated: 06/Dec/22  Resolved: 17/Oct/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.6.0-rc0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-31101 WT table not dropped after collection... Closed
Related
related to SERVER-31591 HMAC key refresh thread holds onto an... Closed
is related to SERVER-26870 Sometimes collection data file is not... Closed
is related to SERVER-31101 WT table not dropped after collection... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

In a replica set

  • create a collection by inserting a document
  • restart the replica set
  • then drop the collection or database

On the primary the corresponding WT tables may not be dropped, and the .wt file will remain.

The following script reproduces the problem about half the time. The script ends with a loop that waits for the .wt file for the collection to disappear from all members; the problem is reproduced if the loop at the end never terminates and the offending file is associated with the member that was elected as primary. Since the script also reliably reproduces SERVER-31101, you have to check whether the .wt file associated with the primary remains in order to determine whether this issue has been reproduced. The script prints replica set status before going into the wait loop to help with this.

Note also that the script uses killall -w, so as written will work on Linux but not OSX.

function repro {
 
    db=/ssd/db # change this as required
    uri='mongodb://localhost:27017/test?replicaSet=rs'
 
    function clean {
        killall -9 -w mongod
        rm -rf $db
    }
        
    function start {
        for i in 0 1 2; do 
            mkdir -p $db/r$i
            mongod --dbpath $db/r$i --logpath ./r$i.log --port 27${i}17 --replSet rs --fork
        done
    }
 
    function stop {
        killall -w mongod
    }
 
    function initiate {
        mongo --quiet --eval '
            rs.initiate({
                _id: "rs",
                 members: [
                     {_id: 0, host: "localhost:27017"},
                     {_id: 1, host: "localhost:27117"},
                     {_id: 2, host: "localhost:27217"}
                ]
            })
        '
    }
 
    # get collection filename for port $1
    function fn {
        mongo --quiet --port $1 --eval '
            rs.slaveOk()
            print(db.runCommand({collStats: "c"}).wiredTiger.uri.substr(17))
        '
    }
 
    # start new replica set
    clean; start; initiate
 
    # create test.c, wait for replication
    mongo --quiet $uri --eval 'db.c.insert({})'
    sleep 5
 
    # note collection filenames on each member
    fn0=$db/r0/$(fn 27017).wt
    fn1=$db/r1/$(fn 27117).wt
    fn2=$db/r2/$(fn 27217).wt
    
    # restart
    stop; start
 
    # drop collection
    sleep 5
    mongo --quiet $uri --eval 'print("drop:", db.c.drop())'
 
    # print member status so we can tell which is offending member
    sleep 5
    mongo --quiet $uri --eval '
        members = rs.status().members
        for (var i in members)
            print(members[i].name, members[i].stateStr)
    '
 
    # wait for all files to disappear
    # problem is reproduced if this waits forever
    while [[ -e $fn0 || -e $fn1 || -e $fn2 ]]; do
        ls -l $fn0 $fn1 $fn2
        sleep 1
    done
}



 Comments   
Comment by Bruce Lucas (Inactive) [ 17/Oct/17 ]

daniel.gottlieb has identified the cause of the repro above to be a long-running OperationContext created by the HMAC thread that holds a cached cursor that prevents the WT table from being dropped. This was identified on SERVER-31101 as one of the long-running OperationContexts that could potentially cause this problem, so I'll dup this ticket to that one.

Generated at Thu Feb 08 04:27:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.