[SERVER-17319] invariant at shutdown rc9, rc10, rc11 with wiredTiger Created: 18/Feb/15  Updated: 05/Feb/16  Resolved: 12/Mar/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.0.0-rc8
Fix Version/s: 3.0.1

Type: Bug Priority: Critical - P2
Reporter: Quentin Conner Assignee: Quentin Conner
Resolution: Done Votes: 1
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File invariant.html     PNG File invariant.png     Text File iostat.log     Text File mongodb-sysbench.log     Text File ss.log    
Issue Links:
Duplicate
is duplicated by SERVER-17220 Invariant failure on shutdown: Device... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Sysbench workload, same as SERVER-17261

Participants:

 Description   

Using the 2/12 nightly build, pre-rc9 79492d9cc1885d74b31b5fe24194dbc227096d6e, found an invariant failure on Amazon Linux in EC2 (C3-8XL) upon shutdown.

Cache was full and checkpoints outstanding at time of shutdown (kill -15) but mixed read/write sysbench execute workload had completed.

2015-02-18T18:49:15.895+0000 I NETWORK  [conn56] end connection 127.0.0.1:43053 (1 connection now open)
2015-02-18T18:49:26.135+0000 I NETWORK  [conn10] end connection 127.0.0.1:43004 (0 connections now open)
2015-02-18T18:49:44.740+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2015-02-18T18:49:44.753+0000 I CONTROL  [signalProcessingThread] now exiting
2015-02-18T18:49:44.753+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2015-02-18T18:49:44.753+0000 I NETWORK  [signalProcessingThread] closing listening socket: 6
2015-02-18T18:49:44.753+0000 I NETWORK  [signalProcessingThread] closing listening socket: 7
2015-02-18T18:49:44.756+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2015-02-18T18:49:44.761+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2015-02-18T18:49:44.761+0000 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
2015-02-18T18:49:44.766+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2015-02-18T18:51:28.501+0000 E STORAGE  [signalProcessingThread] WiredTiger (0) [1424285488:501446][30154:0x7f2de5bbf700], log-server: Connection has open file handles: sizeStorer.wt
2015-02-18T18:51:28.502+0000 E STORAGE  WiredTiger (0) [1424285488:502494][30154:0x7f2de53be700], eviction-server: cache server: exiting with 2352566 pages in memory and 2352564 pages evicted
2015-02-18T18:51:28.502+0000 E STORAGE  WiredTiger (0) [1424285488:502567][30154:0x7f2de53be700], eviction-server: cache server: exiting with 7026 bytes in memory
2015-02-18T18:51:28.502+0000 E STORAGE  WiredTiger (0) [1424285488:502582][30154:0x7f2de53be700], eviction-server: cache server: exiting with 6777 bytes dirty and 1 pages dirty
2015-02-18T18:51:28.511+0000 I -        [signalProcessingThread] Invariant failure: _conn->close(_conn, config) resulted in status UnknownError 16: Device or resource busy at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 206
2015-02-18T18:51:28.596+0000 I CONTROL  [signalProcessingThread]
 0xf41659 0xeeb791 0xed176a 0xd56707 0xa69d1a 0xa99c09 0xf40506 0xf8ee04 0x7f2ded0b7f18 0x7f2dec1c9b2d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B41659"},{"b":"400000","o":"AEB791"},{"b":"400000","o":"AD176A"},{"b":"400000","o":"956707"},{"b":"400000","o":"669D1A"},{"b":"400000","o":"699C09"},{"b":"400000","o":"B40506"},{"b":"400000","o":"B8EE04"},{"b":"7F2DED0B0000","o":"7F18"},{"b":"7F2DEC0E7000","o":"E2B2D"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9-pre-", "gitVersion" : "79492d9cc1885d74b31b5fe24194dbc227096d6e", "uname" : { "sysname" : "Linux", "release" : "3.14.20-20.44.amzn1.x86_64", "version" : "#1 SMP Mon Oct 6 22:52:46 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFCA0FE000", "elfType" : 3 }, { "b" : "7F2DED0B0000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7F2DECEA8000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7F2DECCA4000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7F2DEC9A0000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F2DEC6A2000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7F2DEC48C000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F2DEC0E7000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7F2DED2CC000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf41659]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xeeb791]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xed176a]
 mongod(_ZN5mongo18WiredTigerKVEngine13cleanShutdownEv+0x107) [0xd56707]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD34shutdownGlobalStorageEngineCleanlyEv+0x1A) [0xa69d1a]
 mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x219) [0xa99c09]
 mongod(+0xB40506) [0xf40506]
 mongod(+0xB8EE04) [0xf8ee04]
 libpthread.so.0(+0x7F18) [0x7f2ded0b7f18]
 libc.so.6(clone+0x6D) [0x7f2dec1c9b2d]
-----  END BACKTRACE  -----
2015-02-18T18:51:28.596+0000 I -        [signalProcessingThread]
 
***aborting after invariant() failure



 Comments   
Comment by Quentin Conner [ 12/Mar/15 ]

13 successful test trials using r3.0.1-rc0

Comment by Quentin Conner [ 11/Mar/15 ]

Testing with 085ad3aa18938262a6664ccda6b5fad509c1f68e on the v3.0 branch is going well. No invariants so far with three trials completed.

My plan is to conduct at least 12 trials to achieve a reasonable degree of confidence. Previously we would see the invariant in one out of three trials.

Comment by Oleg Rekutin [ 03/Mar/15 ]

Saw same while shutting down a replica on rc9 after being up for a while and having a constant insert workload:

2015-03-03T18:08:39.142+0000 I -        [signalProcessingThread] Invariant failure: _conn->close(_conn, config) resulted in status UnknownError 16: Device or resource busy at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 206
2015-03-03T18:08:39.268+0000 I CONTROL  [signalProcessingThread] 
 0xf5ae39 0xefb651 0xee25aa 0xd71ed7 0xa96bea 0xac6789 0xf5a026 0xfa7894 0x7fb114d65f18 0x7fb113822b2d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5AE39"},{"b":"400000","o":"AFB651"},{"b":"400000","o":"AE25AA"},{"b":"400000","o":"971ED7"},{"b":"400000","o":"696BEA"},{"b":"400000","o":"6C6789"},{"b":"400000","o":"B5A026"},{"b":"400000","o":"BA7894"},{"b":"7FB114D5E000","o":"7F18"},{"b":"7FB113740000","o":"E2B2D"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "e6577bc37a2edba81b99146934cf7bad00c6e1b2", "uname" : { "sysname" : "Linux", "release" : "3.4.73-64.112.amzn1.x86_64", "version" : "#1 SMP Tue Dec 10 01:50:05 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "0EC0EADFA8F4AD45610DC4515F6732A685D68742" }, { "b" : "7FFF06C10000", "elfType" : 3, "buildId" : "6A7E58423C94E707E8B4A45F098E78D557DD5A7D" }, { "b" : "7FB114D5E000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "A8E8015F84374807F6920DF95D16BBAAAC02AB41" }, { "b" : "7FB114AF1000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "buildId" : "A5C6FCE56D21542B9535D5FBCFE74EBA71D24613" }, { "b" : "7FB11470D000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "75565934551BECB21A027BE0A3AE28A62AEEB7EE" }, { "b" : "7FB114505000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "CC3700E5EACA3A520D1FD1BA3B469F2EFB1B8F11" }, { "b" : "7FB114301000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "3D5EB706F159BCCB9A3B3E5F528B512E2C500F90" }, { "b" : "7FB113FF8000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261" }, { "b" : "7FB113CFA000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "226E54A756FDC288551092B4FDE818C8D6D6FC04" }, { "b" : "7FB113AE5000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "E77BA674F63D5C56373C03316B5E74C5C781A0BC" }, { "b" : "7FB113740000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "D40AD0538A2FFAB96E5540EF73816C675CC0AF5E" }, { "b" : "7FB114F7A000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "37F70DE2ABEA184F24518EB5CC538FF0F54E081F" }, { "b" : "7FB1134FD000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "9DF61878D8918F25CC74AD01F417FDB051DFE3DA" }, { "b" : "7FB113218000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6F1DB0F811D1B210520443442D4437BC43BF9A80" }, { "b" : "7FB113015000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "1A6E97644CC9149C2E1871C6AE1DB51975E78A41" }, { "b" : "7FB112DEA000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F7DF34078FD7BFD684FE46D5F677EEDA1D9B9DC9" }, { "b" : "7FB112BD4000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "87B4EBF2183C8EA4AB657212203EFFE6340E2F4F" }, { "b" : "7FB1129C9000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "381960ACAB9C39461D58BDE7B272C4F61BB3582F" }, { "b" : "7FB1127C6000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7FB1125AF000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "F01971089728188EFA4AA473D8B2DE22CA6DE963" }, { "b" : "7FB11238E000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB9D9" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5ae39]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xefb651]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xee25aa]
 mongod(_ZN5mongo18WiredTigerKVEngine13cleanShutdownEv+0x107) [0xd71ed7]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD34shutdownGlobalStorageEngineCleanlyEv+0x1A) [0xa96bea]
 mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x219) [0xac6789]
 mongod(+0xB5A026) [0xf5a026]
 mongod(+0xBA7894) [0xfa7894]
 libpthread.so.0(+0x7F18) [0x7fb114d65f18]
 libc.so.6(clone+0x6D) [0x7fb113822b2d]
-----  END BACKTRACE  -----
2015-03-03T18:08:39.268+0000 I -        [signalProcessingThread] 
 
***aborting after invariant() failure

Using rc9 Amazon build on Amazon Linux.3.4.73-64.112.amzn1.x86_64

Comment by Nick Judson [ 25/Feb/15 ]

Yes, just ^C. IIRC data was still being written (bulk insert/update via c# beta 2 driver) when the db was shutdown. I'll see if I can repro with another test.

Comment by Quentin Conner [ 25/Feb/15 ]

I have been using the following javascript to "fill up" the wired tiger cache (watch top to see when RSS gets to about half the physical DRAM). After this mongo script finishes, send signal 2 or signal 15 to the mongod process. A few trials may be needed.

'use strict';
 
load('jstests/core/parallelTester.js');
 
function insert(myCollection, desired, randSeed) {
    var batchSize = 1000
 
    print('bulk loading ' + desired + ' documents to ' + myCollection + ' with batch size of ' + batchSize)
    var count = 0
 
    var mydb = db.getSiblingDB('sbtest')
 
    var seed = randSeed
 
    while (count < desired) {
        var bulk = mydb.getCollection(myCollection).initializeUnorderedBulkOp();
        for (var j=0; j < batchSize; count++, j++) {
            var k = Math.round(random()*10000000)
            var c = sysbenchString()
            var pad = '' + Math.round(random()*100000000000)
            for (var y = 0; y < 4; y++)
                pad += '-' + Math.round(random()*100000000000)
            var d = { _id: count, k: k, c: c, pad: pad }
            //print(tojson(d))
            bulk.insert(d)
        }
        bulk.execute();
    }
 
    // create indexes
    print('creating index')
    mydb.getCollection(myCollection).ensureIndex({k: 1})
 
    function sysbenchString() {
        var s = '' + Math.round(random()*100000000000)
        for (var x = 0; x < 8; x++)
            s += '-' + Math.round(random()*100000000000)
        return s
    }
 
    // PRNG borrowed from http://stackoverflow.com/questions/521295/javascript-random-seeds
    function random() {
        var x = Math.sin(seed++) * 10000
        return x - Math.floor(x)
    }
 
}
 
 
function simulate_sysbench_load(num_collections, num_docs_per_collection) {
    // data load phase
    print("creating data for", num_collections, " collections")
 
    // drop old collections
    print('dropping old collections:')
    var mydb = db.getSiblingDB('sbtest')
    for (var offset=1; offset <= num_collections; offset++) {
        var s = 'sbtest' + offset
        print(s)
        mydb.getCollection(s).drop()
    }
 
    var threads = []
 
    print('\nloading new collections:')
    var t = Math.round(Date.now() / num_collections)
    for (var offset=0; offset < num_collections; offset++) {
        var n = offset + 1
        var s = 'sbtest' + n
        threads[offset] = new ScopedThread(insert, s, num_docs_per_collection, (offset + 1) * t)
        threads[offset].start()
    }
 
    print("all threads started")
    threads.forEach(function(t) {
        t.join();
    });
    threads = []
}
 
// load data
var collection_count = 16
var docs_per_collection = 20000000
simulate_sysbench_load(collection_count, docs_per_collection)

Comment by Ramon Fernandez Marina [ 25/Feb/15 ]

Thanks for the screenshot nickj. Are you able to reproduce the invariant failure? I do see the command line you used to launch the server, but how did you stop it? Just ^C?

Thanks,
Ramón.

Comment by Nick Judson [ 25/Feb/15 ]

Invariant.png: Windows 8.1 64bit rc10.

Comment by Nick Judson [ 25/Feb/15 ]

I have seen this on win 8.1 64-bit. Screenshot attached. 3.0 RC10.

Comment by Quentin Conner [ 24/Feb/15 ]

Looks like I have one in rc9 (e6577bc37a2edba81b99146934cf7bad00c6e1b2) when delivered signal 2 (CTRL-C) with the cache full.

I was running my javascript-based sysbench-mongo workload (simulates sysbench but using javascript in the mongo shell).

2015-02-24T22:16:30.922+0000 I NETWORK  [conn1248] end connection 127.0.0.1:43957 (10 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1250] end connection 127.0.0.1:43959 (9 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1251] end connection 127.0.0.1:43960 (9 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1249] end connection 127.0.0.1:43958 (9 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1252] end connection 127.0.0.1:43961 (3 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1253] end connection 127.0.0.1:43962 (3 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1254] end connection 127.0.0.1:43963 (3 connections now open)
2015-02-24T22:16:30.922+0000 I NETWORK  [conn1255] end connection 127.0.0.1:43964 (2 connections now open)
^C2015-02-24T22:16:53.196+0000 I CONTROL  [signalProcessingThread] got signal 2 (Interrupt), will terminate after current cmd ends
2015-02-24T22:16:53.197+0000 I CONTROL  [signalProcessingThread] now exiting
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] closing listening socket: 5
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] closing listening socket: 6
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2015-02-24T22:16:53.197+0000 I NETWORK  [signalProcessingThread] shutdown: going to close sockets...
2015-02-24T22:16:53.197+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2015-02-24T22:16:56.195+0000 E STORAGE  [signalProcessingThread] WiredTiger (0) [1424816216:195428][2998:0x7f3024dba700], log-server: Connection has open file handles: sizeStorer.wt
2015-02-24T22:16:56.195+0000 E STORAGE  WiredTiger (0) [1424816216:195514][2998:0x7f30245b9700], eviction-server: cache server: exiting with 900287 pages in memory and 900285 pages evicted
2015-02-24T22:16:56.195+0000 E STORAGE  WiredTiger (0) [1424816216:195547][2998:0x7f30245b9700], eviction-server: cache server: exiting with 12600 bytes in memory
2015-02-24T22:16:56.195+0000 E STORAGE  WiredTiger (0) [1424816216:195570][2998:0x7f30245b9700], eviction-server: cache server: exiting with 12351 bytes dirty and 1 pages dirty
2015-02-24T22:16:56.198+0000 I -        [signalProcessingThread] Invariant failure: _conn->close(_conn, config) resulted in status UnknownError 16: Device or resource busy at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 206
2015-02-24T22:16:56.202+0000 I CONTROL  [signalProcessingThread]
 0xf41fa9 0xeec0d1 0xed20aa 0xd57057 0xa6acfa 0xa9abe9 0xf40e56 0xf8f754 0x7f3025d9a0a4 0x7f3024ea0ccd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B41FA9"},{"b":"400000","o":"AEC0D1"},{"b":"400000","o":"AD20AA"},{"b":"400000","o":"957057"},{"b":"400000","o":"66ACFA"},{"b":"400000","o":"69ABE9"},{"b":"400000","o":"B40E56"},{"b":"400000","o":"B8F754"},{"b":"7F3025D92000","o":"80A4"},{"b":"7F3024DBB000","o":"E5CCD"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9", "gitVersion" : "e6577bc37a2edba81b99146934cf7bad00c6e1b2", "uname" : { "sysname" : "Linux", "release" : "3.12.13-gentoo", "version" : "#5 SMP Sun May 4 12:39:57 CDT 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF850FF000", "path" : "linux-vdso.so.1", "elfType" : 3 }, { "b" : "7F3025D92000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3 }, { "b" : "7F3025B8A000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3 }, { "b" : "7F3025986000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3 }, { "b" : "7F302567B000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F302537A000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3 }, { "b" : "7F3025164000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F3024DBB000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3 }, { "b" : "7F3025FAF000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf41fa9]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xeec0d1]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xed20aa]
 mongod(_ZN5mongo18WiredTigerKVEngine13cleanShutdownEv+0x107) [0xd57057]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD34shutdownGlobalStorageEngineCleanlyEv+0x1A) [0xa6acfa]
 mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x219) [0xa9abe9]
 mongod(+0xB40E56) [0xf40e56]
 mongod(+0xB8F754) [0xf8f754]
 libpthread.so.0(+0x80A4) [0x7f3025d9a0a4]
 libc.so.6(clone+0x6D) [0x7f3024ea0ccd]
-----  END BACKTRACE  -----
2015-02-24T22:16:56.202+0000 I -        [signalProcessingThread]
 
***aborting after invariant() failure

Comment by Daniel Pasette (Inactive) [ 19/Feb/15 ]

Seeing a very similar invariant, this time on _mdb_catalog.wt in an MCI run.

Logs here: http://buildlogs.mongodb.org/MCI_rhel55/builds/40566/test/slow2_WT_0/32bit.js
Full MCI run log: https://mci.10gen.com/task_log_raw/mongodb_mongo_v3.0_rhel55_ecffa7f73a4199848bffd276140781086c6a451d_15_02_18_05_43_49_slow2_WT_rhel55/0?type=T

2015-02-18T06:42:11.179+0000 E STORAGE  [signalProcessingThread] WiredTiger (0) [1424241731:179101][13959:0x2af443a32940], log-server: Connection has open file handles: _mdb_catalog.wt
2015-02-18T06:42:11.179+0000 E STORAGE  WiredTiger (0) [1424241731:179186][13959:0x2af444434940], eviction-server: cache server: exiting with 1089 pages in memory and 1087 pages evicted
2015-02-18T06:42:11.179+0000 E STORAGE  WiredTiger (0) [1424241731:179222][13959:0x2af444434940], eviction-server: cache server: exiting with 9214 bytes in memory
2015-02-18T06:42:11.179+0000 E STORAGE  WiredTiger (0) [1424241731:179247][13959:0x2af444434940], eviction-server: cache server: exiting with 8965 bytes dirty and 1 pages dirty
2015-02-18T06:42:11.180+0000 I -        [signalProcessingThread] Invariant failure: _conn->close(_conn, config) resulted in status UnknownError 16: Device or resource busy at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp 206
2015-02-18T06:42:11.184+0000 I CONTROL  [signalProcessingThread] 
 0xf56b19 0xef48b1 0xeda81a 0xd5f737 0xa71bfa 0xaa2909 0xf559c6 0xfa4514 0x2af44040183d 0x2af44182afcd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B56B19"},{"b":"400000","o":"AF48B1"},{"b":"400000","o":"ADA81A"},{"b":"400000","o":"95F737"},{"b":"400000","o":"671BFA"},{"b":"400000","o":"6A2909"},{"b":"400000","o":"B559C6"},{"b":"400000","o":"BA4514"},{"b":"2AF4403FB000","o":"683D"},{"b":"2AF441756000","o":"D4FCD"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc9-pre-", "gitVersion" : "ecffa7f73a4199848bffd276140781086c6a451d", "uname" : { "sysname" : "Linux", "release" : "2.6.18-194.el5xen", "version" : "#1 SMP Tue Mar 16 22:01:26 EDT 2010", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "2AF4403FB000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "2AF440617000", "path" : "/lib64/libssl.so.6", "elfType" : 3 }, { "b" : "2AF440865000", "path" : "/lib64/libcrypto.so.6", "elfType" : 3 }, { "b" : "2AF440BB7000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "2AF440DC0000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "2AF440FC4000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "2AF4412C5000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "2AF441548000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "2AF441756000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "2AF4401DD000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 }, { "b" : "2AF441AB0000", "path" : "/usr/lib64/libgssapi_krb5.so.2", "elfType" : 3 }, { "b" : "2AF441CDE000", "path" : "/usr/lib64/libkrb5.so.3", "elfType" : 3 }, { "b" : "2AF441F73000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3 }, { "b" : "2AF442176000", "path" : "/usr/lib64/libk5crypto.so.3", "elfType" : 3 }, { "b" : "2AF44239B000", "path" : "/lib64/libz.so.1", "elfType" : 3 }, { "b" : "2AF4425B0000", "path" : "/usr/lib64/libkrb5support.so.0", "elfType" : 3 }, { "b" : "2AF4427B8000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3 }, { "b" : "2AF4429BA000", "path" : "/lib64/libresolv.so.2", "elfType" : 3 }, { "b" : "2AF442BD0000", "path" : "/lib64/libselinux.so.1", "elfType" : 3 }, { "b" : "2AF442DE8000", "path" : "/lib64/libsepol.so.1", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf56b19]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xef48b1]
 mongod(_ZN5mongo17invariantOKFailedEPKcRKNS_6StatusES1_j+0xDA) [0xeda81a]
 mongod(_ZN5mongo18WiredTigerKVEngine13cleanShutdownEv+0x107) [0xd5f737]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD34shutdownGlobalStorageEngineCleanlyEv+0x1A) [0xa71bfa]
 mongod(_ZN5mongo11exitCleanlyENS_8ExitCodeE+0x219) [0xaa2909]
 mongod(+0xB559C6) [0xf559c6]
 mongod(+0xBA4514) [0xfa4514]
 libpthread.so.0(+0x683D) [0x2af44040183d]
 libc.so.6(clone+0x6D) [0x2af44182afcd]
-----  END BACKTRACE  -----
2015-02-18T06:42:11.184+0000 I -        [signalProcessingThread] 
***aborting after invariant() failure

Comment by Michael Cahill (Inactive) [ 19/Feb/15 ]

I just noticed that we're grabbing _cursorMutex in ~WiredTigerSizeStorer, so that should prevent this – the reset could above would block until thread 2's syncSizeInfo completed. So there must be something else going on...

One thing that is unusual about sizeStorer is that WiredTiger keeps thinking it is idle and closing the underlying file: maybe there is some bad interaction between that code and connection close.

Comment by Mathias Stearn [ 19/Feb/15 ]

michael.cahill Clean shutdown requires a global X lock which would prevent the racing insert. The sessionCache mutex was needed because sessions are returned to the cache after their thread has released all of its locks. I think other than session (RecoveryUnit) creation and destruction, no other interaction with the storage engine should be possible without at least an IS lock which conflicts with shutdown's X lock.

Comment by Michael Cahill (Inactive) [ 19/Feb/15 ]

Hi redbeard0531, here's the sequence I'm concerned about:

Thread 1                      Thread 2
cleanShutdown
  syncSizeInfo                syncSizeInfo [blocks]
                              [gets mutex, checks magic]
  _sizeStorer.reset           ...
  ...
  _conn->close
                              _cursor->insert

There's nothing obviously wrong with the patch, but I believe that previously it acquired _shutdownLock implicitly in syncSizeInfo as part of getting a session, preventing the race above.

Comment by Mathias Stearn [ 19/Feb/15 ]

michael.cahill I don't think that is possible. The destructor is guarded by both a mutex and a "magic flag": https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/wiredtiger/wiredtiger_size_storer.cpp#L72-76. Neither of which should be necessary, but should ensure that this would be impossible even if we screwed something up. Note that the sizeStorer is destroyed before the connection is closed: https://github.com/mongodb/mongo/blob/master/src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp#L198-206. This was changed to be this way in https://github.com/mongodb/mongo/commit/79492d9cc1885d74b31b5fe24194dbc227096d6e which happens to be the commit where this failed.

I'll take a look tomorrow to see if there is any other way that an errant syncCache could be happening at shutdown, but I'd expect a different error (or crash) when using a closed cursor. Do you have any other theories? Is there anything obviously wrong in this patch: https://github.com/mongodb/mongo/commit/3eeddde0149962a3e3ed3656da7f851d17028ec7#diff-395b373fbfa1755cb5f50b30da24e670?

Comment by Michael Cahill (Inactive) [ 19/Feb/15 ]

The main clue here is this message:

Connection has open file handles: sizeStorer.wt

What I believe that indicates is that a sizeStorer syncCache ran concurrently with WT_CONNECTION::close. There is WiredTigerSessionCache::_shutdownLock to prevent regular operations racing in this way: I think we need the same thing for sizeStorer, which doesn't go through the session cache.

Comment by Daniel Pasette (Inactive) [ 18/Feb/15 ]

michael.cahill, note the log messages from the eviction server.

Comment by Quentin Conner [ 18/Feb/15 ]

logs and telemetry used with support-tools/timeseries

Generated at Thu Feb 08 03:44:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.