[SERVER-3869] mongod 2.0.0 crash Created: 15/Sep/11  Updated: 06/Apr/23  Resolved: 16/Oct/11

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Querying
Affects Version/s: 2.0.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Liu Qishuai Assignee: Aaron Staple
Resolution: Duplicate Votes: 4
Labels: FRVIa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.04.2, http://downloads-distro.mongodb.org/repo/ubuntu-upstart/dists/dist/10gen/binary-amd64/mongodb-10gen_2.0.0_amd64.deb


Attachments: Text File crash.log    
Operating System: Linux
Participants:

 Description   

The mongodb primary server in a replicaset crashed after running for hours,

Here is the log:

Thu Sep 15 19:33:57 Invalid access at address: 0x7fd8469db000

Thu Sep 15 19:33:57 Got signal: 11 (Segmentation fault).

Thu Sep 15 19:33:57 Backtrace:
0xa83fc9 0xa845a0 0x7fe935a718f0 0x7fe934fb80ba 0x909ee2 0x95e73c 0x880494 0x88526a 0xa96a46 0x635dd7 0x7fe935a689ca 0x7fe93501770d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa83fc9]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa845a0]
/lib/libpthread.so.0(+0xf8f0) [0x7fe935a718f0]
/lib/libc.so.6(memcpy+0x42a) [0x7fe934fb80ba]
/usr/bin/mongod(_ZN5mongo22fillQueryResultFromObjERNS_11_BufBuilderINS_16TrivialAllocatorEEEPNS_10ProjectionERKNS_7BSONObjEPNS_7DiskLocE+0x8c2) [0x909ee2]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x86c) [0x95e73c]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x880494]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88526a]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xa96a46]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635dd7]
/lib/libpthread.so.0(+0x69ca) [0x7fe935a689ca]
/lib/libc.so.6(clone+0x6d) [0x7fe93501770d]

Thu Sep 15 19:33:57 dbexit:
Thu Sep 15 19:33:57 [conn36194] shutdown: going to close listening sockets...
Thu Sep 15 19:33:57 [conn36194] closing listening socket: 5
Thu Sep 15 19:33:57 [conn36194] closing listening socket: 6
Thu Sep 15 19:33:57 [conn36194] closing listening socket: 7
Thu Sep 15 19:33:57 [conn36194] removing socket file: /tmp/mongodb-27017.sock
Thu Sep 15 19:33:57 [conn36194] shutdown: going to flush diaglog...
Thu Sep 15 19:33:57 [conn36194] shutdown: going to close sockets...
Thu Sep 15 19:33:57 [conn36194] shutdown: waiting for fs preallocator...
Thu Sep 15 19:33:57 [conn36194] shutdown: closing all files...
Thu Sep 15 19:33:57 [conn1] end connection 127.0.0.1:59305
Thu Sep 15 19:33:57 [conn36224] got request after shutdown()
Thu Sep 15 19:33:57 [conn38857] got request after shutdown()
Thu Sep 15 19:33:57 [conn32693] got request after shutdown()
Thu Sep 15 19:33:57 [conn36192] got request after shutdown()
Thu Sep 15 19:33:57 [conn36202] got request after shutdown()
Thu Sep 15 19:33:57 [conn36198] got request after shutdown()
Thu Sep 15 19:33:57 Invalid access at address: 0

Thu Sep 15 19:33:57 Got signal: 11 (Segmentation fault).

Thu Sep 15 19:33:57 [conn32527] got request after shutdown()
Thu Sep 15 19:33:58 [conn39067] got request after shutdown()
Thu Sep 15 19:33:58 [conn36218] got request after shutdown()
Thu Sep 15 19:33:58 [conn32643] got request after shutdown()
Thu Sep 15 19:33:58 [conn32668] got request after shutdown()
Thu Sep 15 19:33:58 [conn39312] got request after shutdown()
Thu Sep 15 19:33:58 [conn38971] got request after shutdown()
Thu Sep 15 19:33:58 [conn33274] got request after shutdown()
Thu Sep 15 19:33:58 [conn36200] got request after shutdown()
Thu Sep 15 19:33:58 [conn33230] got request after shutdown()
Thu Sep 15 19:33:58 ERROR: Client::shutdown not called: slaveTracking
Thu Sep 15 19:33:58 [conn36210] got request after shutdown()
Thu Sep 15 19:33:58 Backtrace:
0xa83fc9 0xa845a0 0x7fe935a718f0 0x7fe934fab392 0x7fe934faee83 0xa9d8e1 0x7fe935a68a61 0x7fe93501770d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa83fc9]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa845a0]
/lib/libpthread.so.0(+0xf8f0) [0x7fe935a718f0]
/lib/libc.so.6(+0x7a392) [0x7fe934fab392]
/lib/libc.so.6(cfree+0x73) [0x7fe934faee83]
/usr/bin/mongod(tls_destructor+0xb1) [0xa9d8e1]
/lib/libpthread.so.0(+0x6a61) [0x7fe935a68a61]
/lib/libc.so.6(clone+0x6d) [0x7fe93501770d]

Thu Sep 15 19:33:58 dbexit: ; exiting immediately

      • glibc detected *** /usr/bin/mongod: double free or corruption (fasttop): 0x0000000003688700 ***
        ======= Backtrace: =========
        /lib/libc.so.6(+0x775b6)[0x7fe934fa85b6]
        /lib/libc.so.6(cfree+0x73)[0x7fe934faee83]
        /usr/bin/mongod(ZNSt8_Rb_treeIPN5mongo9MongoFileES2_St9_IdentityIS2_ESt4lessIS2_ESaIS2_EE5eraseERKS2+0x144)[0x67b7b4]
        /usr/bin/mongod(_ZN5mongo9MongoFile9destroyedEv+0x2c)[0x67a59c]
        /usr/bin/mongod(_ZN5mongo8MongoMMF5closeEv+0x59)[0x759149]
        /usr/bin/mongod(_ZN5mongo9MongoFile13closeAllFilesERSt18basic_stringstreamIcSt11char_traitsIcESaIcEE+0x151)[0x67b1c1]
        /usr/bin/mongod(_ZN5mongo14shutdownServerEv+0x6d1)[0x881071]
        /usr/bin/mongod(_ZN5mongo6dbexitENS_8ExitCodeEPKcb+0x2a0)[0x881710]
        /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x4f6)[0xa84126]
        /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220)[0xa845a0]
        /lib/libpthread.so.0(+0xf8f0)[0x7fe935a718f0]
        /lib/libc.so.6(memcpy+0x42a)[0x7fe934fb80ba]
        /usr/bin/mongod(_ZN5mongo22fillQueryResultFromObjERNS_11_BufBuilderINS_16TrivialAllocatorEEEPNS_10ProjectionERKNS_7BSONObjEPNS_7DiskLocE+0x8c2)[0x909ee2]
        /usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x86c)[0x95e73c]
        /usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4)[0x880494]
        /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a)[0x88526a]
        /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76)[0xa96a46]
        /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287)[0x635dd7]
        /lib/libpthread.so.0(+0x69ca)[0x7fe935a689ca]
        /lib/libc.so.6(clone+0x6d)[0x7fe93501770d]


 Comments   
Comment by Eliot Horowitz (Inactive) [ 16/Oct/11 ]

See SERVER-3961

Comment by Scott Hernandez (Inactive) [ 11/Oct/11 ]

Another workaround for the geoquery getmore (cursor bug) issue is to use a limit which won't require using a cursor, or sending additional batches. Something like limit(100) for example should eliminate the problem for testing. It is possible that not all you have the same issue, but if everyone is using geo-queries than that is most likely it.

Comment by Christian Tonhäuser [ 06/Oct/11 ]

I analyzed this problem together with Scott, and he found out that it's probably a bug in the spatial search code that may occur when concurrent reads and writes are executed on the same dataset.
We got rid of the spatial indexes in our data model and replaced them with something else and now the crash is gone.

According to Scott, the error should be corrected in the current dev branch, but it might take a little longer until an updated build containing the fix is available.

Comment by Christian Tonhäuser [ 27/Sep/11 ]

@Eliot: Our current data model looks roughly like this:
{
"_id": NumberLong(123456789),
"instanceId": "Test",
"description": null,
"createdTS": NumberLong("1313506888027"),
"lastUpdateTS": NumberLong("1316443719486"),
"expiresTS": NumberLong(0),
"properties":

{ "p1": 5, "p2": "B", "p3": "", "p4": "", "p5": "", "p6": 991, "p7": 1, "p8": 6, "p9": "", "p10": "0:0-0", "p11": 0, "p12": 1, "p13": 0, "p14": "", "p15": "-", "p16": "-", "p17": [ "MYS" ] }

,
"version": 0,
"binData": null,
"type": "Line",
"coordinates": [ [114.00405, 4.41043, 0], [ 114.00463, 4.41081, 0]],
"bbox": [[ 114.00405, 4.41043], [ 114.00463, 4.41081]]
}

(Note: The amount of coordinates in the "coordinates" array is usually between 2 and 8 entries for most objects. However, it is not bounded by our application, there are objects with > 50 coordinate values in the array)

Currently, the only indexes on the collection are on the _id field (which is also used for sharding) and on the "bbox" array, which is a spatial index. (Note: The bbox field came to pass when we noticed the first crashes. We still had the spatial index on the coordinates array then and therefore tried to reduce the size of the index by simply storing the spatial bounding box of the objects. Didn't help, though...)

The only requests we are doing are rectangle ($within) queries on the bbox field for reading and "fake" updates, i.e. updates that don't really change the object, but trigger a write operation on the DB anyway.

Our biggest colletion contains ~124 million objects at the moment, but it will slowly grow in the future.

When running load tests with around 50-100 simulated users that do about 98% read operations and 2% write operations (with 1-5 objects contained in every write), it takes at most half an hour until one of the primaries from our replica sets crashes.
When we are running lower loads (e.g. 10 concurrent users) everything is fine.

Also, when running read-only or write-only tests everything's fine, too.

Comment by Dennis Hoene [ 26/Sep/11 ]

@Christian:

Thx, I filed another issue: https://jira.mongodb.org/browse/SERVER-3947

Comment by Christian Tonhäuser [ 26/Sep/11 ]

@Eliot: Please contact me by eMail so we can discuss what kind of data and code I might be able to provide.
Furthermore, we have conducted further testing on single instances, and it seems that with a write-only load the problem does not manifest itself. Will try that out on our sharded environment soon.

@Dennis: This seems to be a different error from the one we excperience, at least the backtrace is radically different.

Comment by Dennis Hoene [ 26/Sep/11 ]

I'm getting a similar misbehaviour during heavy doc inputs and transformations with multiple connections on a

  1. uname -a
    Linux winlager-eval 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux
  2. cat /etc/redhat-release
    CentOS release 5.5 (Final)

I'm running a single server instance 2.0.0, no replication/sharding. The problem occured since server update from 1.8 to 2.0 and php-driver update from 1.1.4 to 1.2.5. mongod terminates with the following log output:

Mon Sep 26 14:52:22 Invalid access at address: 0

Mon Sep 26 14:52:22 Got signal: 11 (Segmentation fault).

Mon Sep 26 14:52:22 Backtrace:
0xa83fc9 0xa845a0 0x3a5ba0eb10 0x54d7dc 0x83d551 0x83d756 0x83d8b0 0x95e2e9 0x880494 0x88526a 0xa96a46 0x635dd7 0x3a5ba0673d 0x3a5b2d3f6d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa83fc9]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa845a0]
/lib64/libpthread.so.0 [0x3a5ba0eb10]
/usr/bin/mongod(_ZN5mongo24FieldRangeVectorIterator7advanceERKNS_7BSONObjE+0x4c) [0x54d7dc]
/usr/bin/mongod(_ZN5mongo11BtreeCursor29skipOutOfRangeKeysAndCheckEndEv+0x81) [0x83d551]
/usr/bin/mongod(_ZN5mongo11BtreeCursor12skipAndCheckEv+0x26) [0x83d756]
/usr/bin/mongod(_ZN5mongo11BtreeCursor7advanceEv+0x100) [0x83d8b0]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x419) [0x95e2e9]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x880494]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88526a]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xa96a46]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635dd7]
/lib64/libpthread.so.0 [0x3a5ba0673d]
/lib64/libc.so.6(clone+0x6d) [0x3a5b2d3f6d]

Logstream::get called in uninitialized state
Mon Sep 26 14:52:22 ERROR: Client::~Client _context should be null but is not; client:conn
Logstream::get called in uninitialized state
Mon Sep 26 14:52:22 ERROR: Client::shutdown not called: conn

Comment by Eliot Horowitz (Inactive) [ 23/Sep/11 ]

Can you send the queries or operations you are doing?
Would it be possible to get a copy of the data set with ops or code? (can be private)

Comment by Christian Tonhäuser [ 23/Sep/11 ]

We now found out that this issue will also crash single server instances, albeit not as frequently (or quickly) as with a ReplicaSet.

Please raise the priority to blocking.

Comment by Christian Tonhäuser [ 22/Sep/11 ]

Turning off NUMA-support in the kernel does not help.
Also, the crash already happened back in some 1.9 nightly build that we still had around for some reason.

The crash is triggered quite quickly when we're running load tests on our application.
Single server instances don't exhibit this behaviour.

Comment by Christian Tonhäuser [ 22/Sep/11 ]

We are encountering the exact same stacktrace on CentOS 5.6 64Bit using the prebuilt (non-static) binaries.

We are running a sharded environment, using three shards, each consisting of primary, secondary and an arbiter.
The primary replset members seem to crash at random, i.e. it's always a different mongod process going down.
The whole cluster is running on six machines in total, including three configserver instances.

Currently running NUMA-Kernel (2.6.18-238.el5), will test soon with a non-NUMA kernel (in case this helps...).

If you need further logs or would like some tests to be executed just ask...

Backtrace of the crash:
Thu Sep 22 12:34:46 Got signal: 11 (Segmentation fault).

Thu Sep 22 12:34:46 Backtrace:
0xa83fc9 0xa845a0 0x395540eb10 0x395487c651 0x909ee2 0x95e73c 0x880494 0x88526a 0xa96a46 0x635dd7 0x395540673d 0x39548d40cd
/data/project/mongodb/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa83fc9]
/data/project/mongodb/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa845a0]
/lib64/libpthread.so.0 [0x395540eb10]
/lib64/libc.so.6(memcpy+0x411) [0x395487c651]
/data/project/mongodb/bin/mongod(_ZN5mongo22fillQueryResultFromObjERNS_11_BufBuilderINS_16TrivialAllocatorEEEPNS_10ProjectionERKNS_7BSONObjEPNS_7DiskLocE+0x8c2) [0x909ee2]
/data/project/mongodb/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x86c) [0x95e73c]
/data/project/mongodb/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x880494]
/data/project/mongodb/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf6a) [0x88526a]
/data/project/mongodb/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xa96a46]
/data/project/mongodb/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635dd7]
/lib64/libpthread.so.0 [0x395540673d]
/lib64/libc.so.6(clone+0x6d) [0x39548d40cd]

Logstream::get called in uninitialized state
Thu Sep 22 12:34:46 ERROR: Client::~Client _context should be null but is not; client:conn
Logstream::get called in uninitialized state
Thu Sep 22 12:34:46 ERROR: Client::shutdown not called: conn

Comment by Liu Qishuai [ 16/Sep/11 ]

more detailed log

Comment by Eliot Horowitz (Inactive) [ 16/Sep/11 ]

Can you send the full log?

Generated at Thu Feb 08 03:04:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.