[SERVER-3869] mongod 2.0.0 crash Created: 15/Sep/11 Updated: 06/Apr/23 Resolved: 16/Oct/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Querying |
| Affects Version/s: | 2.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Liu Qishuai | Assignee: | Aaron Staple |
| Resolution: | Duplicate | Votes: | 4 |
| Labels: | FRVIa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 10.04.2, http://downloads-distro.mongodb.org/repo/ubuntu-upstart/dists/dist/10gen/binary-amd64/mongodb-10gen_2.0.0_amd64.deb |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
The mongodb primary server in a replicaset crashed after running for hours, Here is the log: Thu Sep 15 19:33:57 Invalid access at address: 0x7fd8469db000 Thu Sep 15 19:33:57 Got signal: 11 (Segmentation fault). Thu Sep 15 19:33:57 Backtrace: Thu Sep 15 19:33:57 dbexit: Thu Sep 15 19:33:57 Got signal: 11 (Segmentation fault). Thu Sep 15 19:33:57 [conn32527] got request after shutdown() Thu Sep 15 19:33:58 dbexit: ; exiting immediately
|
| Comments |
| Comment by Eliot Horowitz (Inactive) [ 16/Oct/11 ] |
|
See |
| Comment by Scott Hernandez (Inactive) [ 11/Oct/11 ] |
|
Another workaround for the geoquery getmore (cursor bug) issue is to use a limit which won't require using a cursor, or sending additional batches. Something like limit(100) for example should eliminate the problem for testing. It is possible that not all you have the same issue, but if everyone is using geo-queries than that is most likely it. |
| Comment by Christian Tonhäuser [ 06/Oct/11 ] |
|
I analyzed this problem together with Scott, and he found out that it's probably a bug in the spatial search code that may occur when concurrent reads and writes are executed on the same dataset. According to Scott, the error should be corrected in the current dev branch, but it might take a little longer until an updated build containing the fix is available. |
| Comment by Christian Tonhäuser [ 27/Sep/11 ] |
|
@Eliot: Our current data model looks roughly like this: , (Note: The amount of coordinates in the "coordinates" array is usually between 2 and 8 entries for most objects. However, it is not bounded by our application, there are objects with > 50 coordinate values in the array) Currently, the only indexes on the collection are on the _id field (which is also used for sharding) and on the "bbox" array, which is a spatial index. (Note: The bbox field came to pass when we noticed the first crashes. We still had the spatial index on the coordinates array then and therefore tried to reduce the size of the index by simply storing the spatial bounding box of the objects. Didn't help, though...) The only requests we are doing are rectangle ($within) queries on the bbox field for reading and "fake" updates, i.e. updates that don't really change the object, but trigger a write operation on the DB anyway. Our biggest colletion contains ~124 million objects at the moment, but it will slowly grow in the future. When running load tests with around 50-100 simulated users that do about 98% read operations and 2% write operations (with 1-5 objects contained in every write), it takes at most half an hour until one of the primaries from our replica sets crashes. Also, when running read-only or write-only tests everything's fine, too. |
| Comment by Dennis Hoene [ 26/Sep/11 ] |
|
@Christian: Thx, I filed another issue: https://jira.mongodb.org/browse/SERVER-3947 |
| Comment by Christian Tonhäuser [ 26/Sep/11 ] |
|
@Eliot: Please contact me by eMail so we can discuss what kind of data and code I might be able to provide. @Dennis: This seems to be a different error from the one we excperience, at least the backtrace is radically different. |
| Comment by Dennis Hoene [ 26/Sep/11 ] |
|
I'm getting a similar misbehaviour during heavy doc inputs and transformations with multiple connections on a
I'm running a single server instance 2.0.0, no replication/sharding. The problem occured since server update from 1.8 to 2.0 and php-driver update from 1.1.4 to 1.2.5. mongod terminates with the following log output: Mon Sep 26 14:52:22 Invalid access at address: 0 Mon Sep 26 14:52:22 Got signal: 11 (Segmentation fault). Mon Sep 26 14:52:22 Backtrace: Logstream::get called in uninitialized state |
| Comment by Eliot Horowitz (Inactive) [ 23/Sep/11 ] |
|
Can you send the queries or operations you are doing? |
| Comment by Christian Tonhäuser [ 23/Sep/11 ] |
|
We now found out that this issue will also crash single server instances, albeit not as frequently (or quickly) as with a ReplicaSet. Please raise the priority to blocking. |
| Comment by Christian Tonhäuser [ 22/Sep/11 ] |
|
Turning off NUMA-support in the kernel does not help. The crash is triggered quite quickly when we're running load tests on our application. |
| Comment by Christian Tonhäuser [ 22/Sep/11 ] |
|
We are encountering the exact same stacktrace on CentOS 5.6 64Bit using the prebuilt (non-static) binaries. We are running a sharded environment, using three shards, each consisting of primary, secondary and an arbiter. Currently running NUMA-Kernel (2.6.18-238.el5), will test soon with a non-NUMA kernel (in case this helps...). If you need further logs or would like some tests to be executed just ask... Backtrace of the crash: Thu Sep 22 12:34:46 Backtrace: Logstream::get called in uninitialized state |
| Comment by Liu Qishuai [ 16/Sep/11 ] |
|
more detailed log |
| Comment by Eliot Horowitz (Inactive) [ 16/Sep/11 ] |
|
Can you send the full log? |