[SERVER-6012] Seg Fault, Invalid access at address, Logstream::get called in uninitialized state, on 64bit mongodb 2.0.6 Created: 05/Jun/12  Updated: 15/Aug/12  Resolved: 07/Aug/12

Status: Closed
Project: Core Server
Component/s: Internal Code, Logging
Affects Version/s: 2.0.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Alex Gaudio Assignee: Eric Milkie
Resolution: Incomplete Votes: 0
Labels: crash
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

$ uname -a
Linux <NODENAME_HIDDEN> #50-Ubuntu SMP Mon Sep 12 21:51:23 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

mongod --version
db version v2.0.6, pdfile version 4.5
Tue Jun 5 23:37:53 git version: e1c0cbc25863f6356aa4e31375add7bb49fb05bc

mongod --version
db version v2.0.6, pdfile version 4.5
Tue Jun 5 23:37:53 git version: e1c0cbc25863f6356aa4e31375add7bb49fb05bc


Operating System: Linux
Participants:

 Description   

Hello. I'm hoping you can help me figure out why my mongod instance crashes when I execute a certain query. When I make the below query, I consistently kill the running mongod instance and receive same error message.

I make the below query to a collection containing 170,941,610 documents. The collection has an index on the field, "transformed".

Tue Jun 5 22:40:25 [conn933] query <DB>.stats_cumulative_stats query:

{ transformed: false }

ntoreturn:500000 nscanned:58 255 nreturned:58254 reslen:4194308 607ms

The log shows this error:

Tue Jun 5 22:40:33 Invalid access at address: 0x3d27000

Tue Jun 5 22:40:33 Got signal: 11 (Segmentation fault).

Tue Jun 5 22:40:33 Backtrace:
0xa95ce9 0xa9640c 0x7fbcbcb5dc60 0x7fbcbc0a3885 0x571617 0x911b3f 0x941b2c 0x885a24 0x88e6da 0xaa0bc8 0x6389f7 0x7fbcbcb54d8c 0x7fbcbc0fec2d.
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3a9) [0xa95ce9]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x22c) [0xa9640c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfc60) [0x7fbcbcb5dc60]
/lib/x86_64-linux-gnu/libc.so.6(memcpy+0x425) [0x7fbcbc0a3885]
/usr/bin/mongod(_ZNK5mongo10Projection9transformERKNS_7BSONObjERNS_14BSONObjBuilderE+0x147) [0x571617]
/usr/bin/mongod(_ZN5mongo22fillQueryResultFromObjERNS_11_BufBuilderINS_16TrivialAllocatorEEEPNS_10ProjectionERKNS_7BSONObjEPNS_7DiskLocE+0xaf) [0x911b3f]
/usr/bin/mongod(_ZN5mongo14processGetMoreEPKcixRNS_5CurOpEiRb+0x87c) [0x941b2c]
/usr/bin/mongod(_ZN5mongo15receivedGetMoreERNS_10DbResponseERNS_7MessageERNS_5CurOpE+0x1b4) [0x885a24]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xf5a) [0x88e6da]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x78) [0xaa0bc8]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x6389f7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c) [0x7fbcbcb54d8c]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbcbc0fec2d]

Logstream::get called in uninitialized state
Tue Jun 5 22:40:33 ERROR: Client::~Client _context should be null but is not; client:conn
Logstream::get called in uninitialized state
Tue Jun 5 22:40:33 ERROR: Client::shutdown not called: conn

To recover the server, I follow these steps (my db is journaled): First, I start it up with /etc/init.d/mongodb start. At this point, I cannot log in via mongo client (ie mongo --host <HOST>) returns "connect failed". This is the error message:

MongoDB shell version: 2.0.4
connecting to: <HOST>:27017/<DB>
Tue Jun 5 19:03:12 Error: couldn't connect to server <HOST>:<PORT> shell/mongo.js:86
exception: connect failed

Then, I restart server via /etc/init.d/mongodb restart and I can log in as usual via the mongo client.



 Comments   
Comment by Eric Milkie [ 07/Aug/12 ]

Closing as incomplete; we will watch for this crash to happen again.

Comment by Eric Milkie [ 26/Jun/12 ]

Hi Alex,
Without any log data, it's hard to tell exactly what happened, and I can't really diagnose the problem without something further to go on.

Comment by Alex Gaudio [ 26/Jun/12 ]

Hey Eric,

Thanks for your response - I repaired the collection and am operating normally.

At this point, I'm confused why the server crashed and corrupted my data when the log partition filled up, given that the db data and logs are in two separate partitions. I wouldn't be surprised if there's a bug or two that contributed to the problem.

Also, a possible future feature: Given mongo's instability when it runs out of disk space mid-write, it may be helpful to add some sort of corruption protection or ability to shutdown the server instance when the server log approaches max disk usage.

Thanks for your help. Alex

Comment by Eric Milkie [ 15/Jun/12 ]

Let me see if I can surmise what happened here.

 
$ sudo mv /var/logs/mongodb/{mongodb.log,mongodb.log2} ; sudo rm -rf /var/logs/mongodb/mongodb.log2
$ sudo -u mongodb /usr/bin/mongod --port 27017 --dbpath /var/lib/mongodb --logpath /var/log/mongodb/mongodb2.log --auth # --auth may not have been there

The move of the file does not affect the open file handle, so mongod keeps attempting to extend the file (but can't).
The delete of the file does not affect the open file handle either, although it does make it impossible to look up the file in the directory again. mongod keeps attempting to extend the file.
When you attempted to start another mongod, you give it the same log path, which writes a new file in that location since the directory entry was marked as deleted in the prior command. There are now two mongod's attempting to write to the disk. The new mongod hasn't yet attempted to open the network port nor the lock file on disk, so it doesn't know that the port is already in use and that the data directory is already in use.
Eventually, something must have freed up disk space enough for both instances of mongod to exit.

At this point, it would seem that your datafiles are corrupt. The usual procedures for recovery apply here: restore from backup, or from another node in a replica set, or, failing that, running repair – which might lose any number of records and which will only fix structural corruption – data corruption could remain unless you have a way of checking for it yourself.

Comment by Alex Gaudio [ 11/Jun/12 ]

>> Did you delete the log file while the server was still running?
Yea, see below. The delete-pending fact is interesting and may explain some of my response to your third question

>> What do you mean specifically by "forked another mongod instance"

Sorry, I mis-spoke a little. Here's basically a replay of what happened when I deleted the log file and managed to restart the server:
$ sudo /etc/init.d/mongodb stop # hangs ...

$ df -h
/dev/xvda1 7.9G 7.9G 0 100%

$ ps -ef|grep mongo
mongodb 17665 1 0 Jun05 ? 00:46:01 /usr/bin/mongod --port 27017 --dbpath /var/lib/mongodb --logpath /var/log/mongodb/mongodb.log --auth

$ sudo mv /var/logs/mongodb/

{mongodb.log,mongodb.log2}

; sudo rm -rf /var/logs/mongodb/mongodb.log2
$ sudo -u mongodb /usr/bin/mongod --port 27017 --dbpath /var/lib/mongodb --logpath /var/log/mongodb/mongodb2.log --auth # --auth may not have been there

  1. ps -ef now shows 2 mongod instances
    $ sudo /etc/init.d/mongodb stop # nothing in this output indicated an improper shutdown
  2. ps -ef now shows 0 mongod instances

>> Do you have a log from the server where you attempted to start it after the failure and yet couldn't connect?

Don't have that log anymore, but if I remember correctly, I don't think the server logged anything (which suggests that perhaps the server log was still somehow "delete-pending"?). I just tried to regenerate the situation where I couldn't connect my mongo client to the server, but can't reproduce that error anymore. Maybe I can't reproduce because I removed indexes on this collection last Friday.

That said, while my server isn't freezing up since deleting the indexes, I still do get this message about corrupt data from a mongo shell:
> db.stats_cumulative_stats.find(

{parsed: false}

) │

error:

{ "$err" : "BSONElement: bad type -48", "code" : 10320 }

2;;bW9uZ28gLS1ob3N0IGVjMi0xMDctMjItNjEtODIuY29tcHV0ZS0xLmFtYXpvCk1>

The corresponding log (similar to initial log I posted):

<< line about conn1 authentication>>>
Mon Jun 11 22:17:07 [clientcursormon] mem (MB) res:836 virt:601889 mapped:300862
Mon Jun 11 22:18:07 [clientcursormon] mem (MB) res:3335 virt:601889 mapped:300862
Mon Jun 11 22:20:07 [clientcursormon] mem (MB) res:3952 virt:601889 mapped:300862
Mon Jun 11 22:20:48 [conn1] Assertion: 10320:BSONElement: bad type -48
0x5848a2 0x507cdb 0x664e11 0x666afd 0x86e5bd 0x86e7f8 0x9506c2 0x8c9616 0x8d9700 0x8d9d56 0x8dd66e 0x8df083 0x8e0357 0x943295 0x8891
07 0x88dcd9 0xaa0bc8 0x6389f7 0x7f2f23776d8c 0x7f2f22d20c2d
/usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x112) [0x5848a2]
/usr/bin/mongod(_ZNK5mongo11BSONElement4sizeEv+0x1cb) [0x507cdb]
/usr/bin/mongod(_ZN5mongo7Matcher13matchesDottedEPKcRKNS_11BSONElementERKNS_7BSONObjEiRKNS_14ElementMatcherEbPNS_12MatchDetailsE+0x
1b01) [0x664e11]
/usr/bin/mongod(_ZN5mongo7Matcher7matchesERKNS_7BSONObjEPNS_12MatchDetailsE+0xdd) [0x666afd]
/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher7matchesERKNS_7BSONObjERKNS_7DiskLocEPNS_12MatchDetailsEb+0xcd) [0x86e5bd]
/usr/bin/mongod(_ZN5mongo19CoveredIndexMatcher14matchesCurrentEPNS_6CursorEPNS_12MatchDetailsE+0xa8) [0x86e7f8]
/usr/bin/mongod(_ZN5mongo11UserQueryOp4nextEv+0x262) [0x9506c2]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner6nextOpERNS_7QueryOpE+0x56) [0x8c9616]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner4nextEv+0x110) [0x8d9700]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet6Runner22runUntilFirstCompletesEv+0x56) [0x8d9d56]
/usr/bin/mongod(_ZN5mongo12QueryPlanSet5runOpERNS_7QueryOpE+0x11e) [0x8dd66e]
/usr/bin/mongod(_ZN5mongo16MultiPlanScanner9runOpOnceERNS_7QueryOpE+0x523) [0x8df083]
/usr/bin/mongod(_ZN5mongo16MultiPlanScanner5runOpERNS_7QueryOpE+0x17) [0x8e0357]
/usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0xfa5) [0x943295]
/usr/bin/mongod() [0x889107]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x559) [0x88dcd9]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x78) [0xaa0bc8]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x6389f7]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c) [0x7f2f23776d8c]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f2f22d20c2d]
Mon Jun 11 22:20:48 [conn1] assertion 10320 BSONElement: bad type -48 ns:<DB>.stats_cumulative_stats query:

{ parsed: false }

Mon Jun 11 22:20:48 [conn1] query <DB>.stats_cumulative_stats query:

{ parsed: false }

exception: BSONElement: bad type -4
8 code:10320 reslen:71 231049ms
Mon Jun 11 22:22:58 [conn1] end connection <IP>:48274
Mon Jun 11 22:23:00 [initandlisten] connection accepted from <IP>:48332 #2
Mon Jun 11 22:25:07 [clientcursormon] mem (MB) res:3940 virt:601889 mapped:300862
Mon Jun 11 22:30:07 [clientcursormon] mem (MB) res:3940 virt:601889 mapped:300862

----------------------
So at this point, I'd say there were/are 2 problems:

1. partition filled up mid-write and corrupted data.
2. indexes on that table were broken and caused a query using the index to crash the database.

Comment by Eric Milkie [ 11/Jun/12 ]

Hi Alex.
Did you delete the log file while the server was still running? This would have just made the file delete-pending, and the server would have continued try to extend the log file.
What do you mean specifically by "forked another mongod instance" – you pointed the new instance at a different data directory?
Do you have a log from the server where you attempted to start it after the failure and yet couldn't connect?

Comment by Alex Gaudio [ 11/Jun/12 ]

Hi Eric, thanks for taking this one. I should also add that this might be due to an event that occurred a few days before I filed this bug: namely, our log partition filled up and locked the mongo server. After I deleted the log file, the mongod instance wouldn't shut down (ie "/$ etc/init.d/mongodb stop" would just hang). I didn't explicitly kill -9 the mongod instance, but instead forked another mongod instance and used /etc/init.d/mongodb stop to shut down both.

I also didn't mention that we are not replicated or sharded.

Hope that helps! Thanks, Alex

Generated at Thu Feb 08 03:10:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.