[SERVER-17029] Fatal Assertion 16137 Created: 23/Jan/15  Updated: 12/Jun/15  Resolved: 12/Jun/15

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: quantomass Assignee: David Storch
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Sprint: Quint Iteration 4, Quint Iteration 5
Participants:

 Description   

Hello,

we run a single mongod instance on Ubuntu 14.10 server. The version of db is

db version v2.6.5
git version: e99d4fcb4279c0279796f237aa92fe3b64560bf6
build info: Linux build8.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49

Today mongod crashed with the following error:

2015-01-23T19:29:41.183+0100 [conn219322] Fatal Assertion 16137
2015-01-23T19:29:41.196+0100 [conn219322] 0x11e9b11 0x118b849 0x116e37d 0xa1d354 0xa1e084 0xd5f0b9 0xba1052 0xba2630 0x770d5f 0x119f93e 0x7f4167bd10a5 0x7f4166eca84d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e9b11]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x118b849]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116e37d]
 /usr/bin/mongod(_ZN5mongo4Lock6DBRead8unlockDBEv+0x1a4) [0xa1d354]
 /usr/bin/mongod(_ZN5mongo4Lock6DBReadD0Ev+0x24) [0xa1e084]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1b89) [0xd5f0b9]
 /usr/bin/mongod() [0xba1052]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x580) [0xba2630]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9f) [0x770d5f]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4ee) [0x119f93e]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a5) [0x7f4167bd10a5]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4166eca84d]
2015-01-23T19:29:41.197+0100 [conn219322]
 
***aborting after fassert() failure
 
 
2015-01-23T19:29:41.200+0100 [conn219322] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11e9b11 0x11e8eee 0x7f4166e06da0 0x7f4166e06d27 0x7f4166e08418 0x116e3ea 0xa1d354 0xa1e084 0xd5f0b9 0xba1052 0xba2630 0x770d5f 0x119f93e 0x7f4167bd10a5 0x7f4166eca84d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e9b11]
 /usr/bin/mongod() [0x11e8eee]
 /lib/x86_64-linux-gnu/libc.so.6(+0x36da0) [0x7f4166e06da0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f4166e06d27]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f4166e08418]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116e3ea]
 /usr/bin/mongod(_ZN5mongo4Lock6DBRead8unlockDBEv+0x1a4) [0xa1d354]
 /usr/bin/mongod(_ZN5mongo4Lock6DBReadD0Ev+0x24) [0xa1e084]
 /usr/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1b89) [0xd5f0b9]
 /usr/bin/mongod() [0xba1052]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x580) [0xba2630]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9f) [0x770d5f]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x4ee) [0x119f93e]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x80a5) [0x7f4167bd10a5]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f4166eca84d]

Previous message in logs with this conn219322 is about 13 min before:

2015-01-23T19:16:58.394+0100 [conn219322] command probtn.$cmd command: findAndModify { findandmodify: "UserStatistic", query: { _id: ObjectId('54a042b94e27d95663057f32') }, sort: { _id: 1 }, new: 0, remove: 0, upsert: 0, update: { $set: { ...  } } } keyUpdates:0 numYields:0 locks(micros) w:130567 reslen:373 130ms

(note that I removed data from $set for privacy reasons).
After restart mongod didn't crash again.

We communicate with db from our application via http://mongodb.github.io/node-mongodb-native/ driver version 1.3.15



 Comments   
Comment by David Storch [ 12/Jun/15 ]

Hi quantomass,

Unfortunately, after further investigating this issue, we have unable to reproduce or identify the exact root cause. I am going to close this ticket as "Cannot Reproduce". If you see this again or have further clues for reproducing the issue, please re-open the ticket and we will investigate further.

As mentioned before, this appears to be a problem with yield recovery in 2.6. However, the code responsible for yielding and reacquiring coarse locks (i.e. global, database and collection-level locks) changed substantially between version 2.6 and version 3.0, including a completely rewritten lock manager. Therefore, I expect that this problem will not manifest on 3.0.x stable versions.

Please let me know if you have any further questions.

Best,
Dave

Comment by Andy Schwerin [ 03/Feb/15 ]

The stack trace actually indicates that the problem occurred on a find, not a findAndModify. Somehow, it appears that the database lock was released before the function returned, and then the RAII DBRead object also tried to unlock the lock. david.storch should take a look, to see if there's a problem in yield recovery somewhere.

Comment by quantomass [ 27/Jan/15 ]

Hi,

It's ok for us to share latest db backup, please tell if there will still be a need to run a test application server?

Comment by Ramon Fernandez Marina [ 26/Jan/15 ]

quantomass, I'm still unable to trigger this assertion. I tried using iibench to create large databases/collections and add load; this is how my dataset looks like:

> db.stats()
{
        "db" : "iibench2",
        "collections" : 4,
        "objects" : 74905025,
        "avgObjSize" : 3582.442145209884,
        "dataSize" : 268342918448,
        "storageSize" : 271270017152,
        "numExtents" : 170,
        "indexes" : 6,
        "indexSize" : 13186514432,
        "fileSize" : 291849109504,
        "nsSizeMB" : 16,
        "dataFileVersion" : {
                "major" : 4,
                "minor" : 5
        },
        "extentFreeList" : {
                "num" : 0,
                "totalSize" : 0
        },
        "ok" : 1
}
> db.bigcol1.stats()
{
        "ns" : "iibench2.bigcol1",
        "count" : 50448000,
        "size" : 218646272000,
        "avgObjSize" : 4334,
        "storageSize" : 221181954496,
        "numExtents" : 124,
        "nindexes" : 4,
        "lastExtentSize" : 2146426864,
        "paddingFactor" : 1,
        "systemFlags" : 1,
        "userFlags" : 1,
        "totalIndexSize" : 11480673792,
        "indexSizes" : {
                "_id_" : 2670943856,
                "price_1_customerid_1" : 2473378992,
                "cashregisterid_1_price_1_customerid_1" : 3170677328,
                "price_1_dateandtime_1_customerid_1" : 3165673616
        },
        "ok" : 1
}
> db.bigcol2.stats()
{
        "ns" : "iibench2.bigcol2",
        "count" : 24457010,
        "size" : 49696644320,
        "avgObjSize" : 2032,
        "storageSize" : 50088046272,
        "numExtents" : 44,
        "nindexes" : 2,
        "lastExtentSize" : 2146426864,
        "paddingFactor" : 1,
        "systemFlags" : 0,
        "userFlags" : 1,
        "totalIndexSize" : 1705840640,
        "indexSizes" : {
                "price_1_customerid_1" : 840190288,
                "price_1_datetime_1_customerid_1" : 865650352
        },
        "ok" : 1
}

I launched 100 threads doing a multi-update and 10000 threads doing a findAndModify, both on bigcol1, but no assertion.

I was going to ask you for more details about your application, but I'm afraid that may not be sufficient to reproduce the issue, so I'm working on other options. For example, would you be able to run a test environment with the same dataset and same application but at logLevel 2? It's very strange that an established connection triggers this assertion after some apparent inactivity.

Thanks,
Ramón.

Comment by quantomass [ 25/Jan/15 ]

Here is our hardware info, say if you need something else.
Note that I've renamed these two collections (as was going to in my previous post).

CPU:

$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 60
Model name:            Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Stepping:              3
CPU MHz:               3891.273
CPU max MHz:           3900.0000
CPU min MHz:           800.0000
BogoMIPS:              6800.48
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0-7

Memory:

$ cat /proc/meminfo
MemTotal:       32807612 kB
MemFree:          334912 kB
MemAvailable:   31861308 kB
Buffers:          141752 kB
Cached:         30742020 kB
SwapCached:         8648 kB
Active:         19735168 kB
Inactive:       11502316 kB
Active(anon):     172024 kB
Inactive(anon):   190852 kB
Active(file):   19563144 kB
Inactive(file): 11311464 kB
Unevictable:        4720 kB
Mlocked:            4720 kB
SwapTotal:      16768892 kB
SwapFree:       16711648 kB
Dirty:              3964 kB
Writeback:             0 kB
AnonPages:        352664 kB
Mapped:          1892484 kB
Shmem:              5664 kB
Slab:             932784 kB
SReclaimable:     905204 kB
SUnreclaim:        27580 kB
KernelStack:        6640 kB
PageTables:        72012 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    33172696 kB
Committed_AS:    1169620 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      127360 kB
VmallocChunk:   34359605356 kB
HardwareCorrupted:     0 kB
AnonHugePages:    116736 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
DirectMap4k:       82368 kB
DirectMap2M:     3971072 kB
DirectMap1G:    29360128 kB

Disks:

$ sudo lshw -class disk
  *-disk                  
       description: ATA Disk
       product: INTEL SSDSC2BW24
       physical id: 0.0.0
       bus info: scsi@0:0.0.0
       logical name: /dev/sda
       version: DC02
       serial: CVDA330003BF2403GN
       size: 223GiB (240GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 sectorsize=512 signature=000eea1f
  *-disk
       description: ATA Disk
       product: INTEL SSDSC2BW24
       physical id: 0.0.0
       bus info: scsi@1:0.0.0
       logical name: /dev/sdb
       version: DC02
       serial: CVDA3296013Q2403GN
       size: 223GiB (240GB)
       capabilities: partitioned partitioned:dos
       configuration: ansiversion=5 sectorsize=512 signature=000bc782

Comment by Ramon Fernandez Marina [ 25/Jan/15 ]

Thanks for the additional info quantomass. I'll work on reproducing the issue on my end by creating large collections and running some queries, but it may be very hard to trigger the same assertion. If I'm unsuccessful, would you consider sharing your database files with us? If the answer is yes you can privately and securely upload them as you did with the log for access to MongoDB staff only and for the purpose of reproducing this issue – please let me know.

I'm also interested in the hardware configuration you're using: number of CPUs and cores, CPU kind and speed, memory, and storage configuration.

Lastly, you may want to consider running a replica set for robustness. If you do, I'd recommend you run Ubuntu 14.04 instead of 14.10 in one of the nodes in the off-chance that the issue is triggered by an OS instability.

Thanks,
Ramón

Comment by quantomass [ 24/Jan/15 ]

Actually, it crashed again just now with the same error. I've uploaded the mongolog2.log starting from yesterday's restart (so partially overlapping with the previous file). Crash happens at

2015-01-24T17:30:17.809+0100 [conn107] Fatal Assertion 16137

This time previous records with the conn107 are updates to CustomStatistic collection.

Don't know if this is relevant, but these two collections (CustomStatistic and UserStatistic) are one of the biggest in our db and with the larges indexes. Here is some stats

> db.stats()
{
    "db" : "probtn",
    "collections" : 49,
    "objects" : 147020281,
    "avgObjSize" : 374.03227784607486,
    "dataSize" : 54990330592,
    "storageSize" : 62272278384,
    "numExtents" : 298,
    "indexes" : 82,
    "indexSize" : 20767015472,
    "fileSize" : 90084212736,
    "nsSizeMB" : 16,
    "dataFileVersion" : {
        "major" : 4,
        "minor" : 5
    },
    "extentFreeList" : {
        "num" : 0,
        "totalSize" : 0
    },
    "ok" : 1
}
> db.UserStatistic.stats()
{
    "ns" : "probtn.UserStatistic",
    "count" : 29321430,
    "size" : 7906213280,
    "avgObjSize" : 269,
    "storageSize" : 9305935856,
    "numExtents" : 25,
    "nindexes" : 2,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1.0100000000000016,
    "systemFlags" : 1,
    "userFlags" : 1,
    "totalIndexSize" : 4143318816,
    "indexSizes" : {
        "_id_" : 1579823952,
        "bundle_id_1_device_type_1_device_uid_1" : 2563494864
    },
    "ok" : 1
}
> db.CustomStatistic.stats()
{
    "ns" : "probtn.CustomStatistic",
    "count" : 44846317,
    "size" : 18787568688,
    "avgObjSize" : 418,
    "storageSize" : 20038070176,
    "numExtents" : 30,
    "nindexes" : 2,
    "lastExtentSize" : 2146426864,
    "paddingFactor" : 1,
    "systemFlags" : 1,
    "userFlags" : 1,
    "totalIndexSize" : 6433547232,
    "indexSizes" : {
        "_id_" : 2580451888,
        "bundle_id_1_device_type_1_device_uid_1" : 3853095344
    },
    "ok" : 1
}

Do you think the problem may be caused by the size of these collections/indexes? Then probably for now I should rename them to smth like CustomStatistic_bkp so that new ones will start from scratch? This will cause less damage to us than crashing.

Comment by quantomass [ 24/Jan/15 ]

I ran the same command as the one at 2015-01-23T19:16:58.394+0100 [conn219322] command probtn.$cmd command: findAndModify ... from mongo client and till now the server is working fine.

Comment by Ramon Fernandez Marina [ 24/Jan/15 ]

Thanks for uploading the logs quantomass. I take that if you re-run the findAndModify command the assertion is not triggered again, is that correct?

Comment by quantomass [ 24/Jan/15 ]

Hi,
I've uploaded log file starting from the previous restart

Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

Well, I've never seen this assertion or backtrace before, so I was interested in seeing if there are some previous clues in the log. A log from the last restart until the crash would be a good place to start. One option is to compress it and upload it as follows:

scp -P 722 -r compressed_log_file  SERVER-17029@www.mongodb.com:

When prompted for a password just hit enter.

Thanks,
Ramón.

Comment by quantomass [ 23/Jan/15 ]

Yes, but do you need some limited time period? Because the service has been running for about 20 days now so the log is huge.

Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

This is what addr2line has to say:

$ addr2line -ie ~/debug/2.6.5/mongod 0x11e9b11 0x118b849 0x116e37d 0xa1d354 0xa1e084 0xd5f0b9 0xba1052 0xba2630 0x770d5f 0x119f93e 0x7f4167bd10a5 0x7f4166eca84d
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/stacktrace.cpp:304
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/log.cpp:127
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/debug_util.h:62
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/assert_util.cpp:139
/srv/10gen/mci-exec/mci/shell/src/src/third_party/boost/boost/thread/locks.hpp:401
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/concurrency/qlock.h:349
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/d_concurrency.cpp:678
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/d_concurrency.cpp:637
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/namespace_string.h:49
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/query/new_find.cpp:830
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/instance.cpp:269
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/instance.cpp:510
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/net/hostandport.h:31
/srv/10gen/mci-exec/mci/shell/src/src/mongo/db/db.cpp:202
/srv/10gen/mci-exec/mci/shell/src/src/third_party/boost/boost/smart_ptr/shared_ptr.hpp:424
/srv/10gen/mci-exec/mci/shell/src/src/mongo/util/net/message_server_port.cpp:210
??:0
??:0

Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

Are you able to upload full logs for this instance? It's ok to sanitize operation contents, IP addresses, etc.

Generated at Thu Feb 08 03:43:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.