[SERVER-2915] Segfault while creating compound index Created: 07/Apr/11  Updated: 29/Feb/12  Resolved: 21/Jan/12

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Conigliaro Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: rsc1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 10.10


Operating System: Linux
Participants:

 Description   

This just happened on one of the masters on one of my replica sets while I was creating an index:

Thu Apr 7 17:26:34 Invalid access at address: 0x3255000

Thu Apr 7 17:26:34 Got signal: 11 (Segmentation fault).

Thu Apr 7 17:26:34 Backtrace:
0x8a2e59 0x8a3430 0x7f220ddeb8f0 0x7f220d33208b 0x7f220db707be
0x7f220db7104c 0x6d6a74 0x6d0694 0x7dae1e 0x7dc741 0x642c25 0x6477c6
0x752225 0x757938 0x8a3b3e 0x8b6a40 0x7f220dde29ca 0x7f220d39170d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x8a2e59]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv
+0x220) [0x8a3430]
/lib/libpthread.so.0(+0xf8f0) [0x7f220ddeb8f0]
/lib/libc.so.6(memcpy+0x3fb) [0x7f220d33208b]
/usr/lib/libstdc++.so.6(_ZNSs4_Rep8_M_cloneERKSaIcEm+0x7e)
[0x7f220db707be]
/usr/lib/libstdc++.so.6(_ZNSsC1ERKSs+0x3c) [0x7f220db7104c]
/usr/bin/
mongod(_ZNK5mongo11ReplSetImpl16_summarizeStatusERNS_14BSONObjBuilderE
+0xa94) [0x6d6a74]
/usr/bin/
mongod(_ZN5mongo19CmdReplSetGetStatus3runERKSsRNS_7BSONObjERSsRNS_14BSONObjBuilderEb
+0x114) [0x6d0694]
/usr/bin/
mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb
+0x5ae) [0x7dae1e]
/usr/bin/
mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_10BufBuilderERNS_14BSONObjBuilderEbi
+0x831) [0x7dc741]
/usr/bin/
mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_10BufBuilderERNS_14BSONObjBuilderEbi
+0x35) [0x642c25]
/usr/bin/
mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1
+0x2e86) [0x6477c6]
/usr/bin/mongod() [0x752225]
/usr/bin/
mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE
+0x5b8) [0x757938]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e)
[0x8a3b3e]
/usr/bin/mongod(thread_proxy+0x80) [0x8b6a40]
/lib/libpthread.so.0(+0x69ca) [0x7f220dde29ca]
/lib/libc.so.6(clone+0x6d) [0x7f220d39170d]

Thu Apr 7 17:26:34 dbexit:
Thu Apr 7 17:26:34 [conn30] shutdown: going to close listening
sockets...
Thu Apr 7 17:26:34 [conn30] closing listening socket: 5
Thu Apr 7 17:26:34 [conn30] closing listening socket: 6
Thu Apr 7 17:26:34 [conn30] closing listening socket: 7
Thu Apr 7 17:26:34 [conn30] closing listening socket: 8
Thu Apr 7 17:26:34 [conn30] removing socket file: /tmp/
mongodb-27018.sock
Thu Apr 7 17:26:34 [conn30] removing socket file: /tmp/
mongodb-28018.sock
Thu Apr 7 17:26:34 [conn30] shutdown: going to flush diaglog...
Thu Apr 7 17:26:34 [conn30] shutdown: going to close sockets...
Thu Apr 7 17:26:34 [conn30] shutdown: waiting for fs preallocator...
Thu Apr 7 17:26:34 [conn30] shutdown: closing all files...
1/55 1%
Thu Apr 7 17:26:37 Invalid access at address: 0x7f220273e8c0

Thu Apr 7 17:26:37 Got signal: 11 (Segmentation fault).

Thu Apr 7 17:26:37 Backtrace:
0x8a2e59 0x8a3430 0x7f220ddeb8f0 0x730420 0x73048a 0x77447d 0x710acc
0x70ee12 0x757044 0x758caf 0x8a3b3e 0x8b6a40 0x7f220dde29ca
0x7f220d39170d
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x8a2e59]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv
+0x220) [0x8a3430]
/lib/libpthread.so.0(+0xf8f0) [0x7f220ddeb8f0]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0) [0x730420]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE
+0x3a) [0x73048a]
/usr/bin/
mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci
+0x19d) [0x77447d]
/usr/bin/mongod() [0x710acc]
/usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42)
[0x70ee12]
/usr/bin/mongod(_ZN5mongo14receivedInsertERNS_7MessageERNS_5CurOpE
+0x394) [0x757044]
/usr/bin/
mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_8SockAddrE
+0x192f) [0x758caf]
/usr/bin/mongod(_ZN5mongo10connThreadEPNS_13MessagingPortE+0x21e)
[0x8a3b3e]
/usr/bin/mongod(thread_proxy+0x80) [0x8b6a40]
/lib/libpthread.so.0(+0x69ca) [0x7f220dde29ca]
/lib/libc.so.6(clone+0x6d) [0x7f220d39170d]

Thu Apr 7 17:26:37 dbexit: ; exiting immediately
Thu Apr 7 17:26:38 ERROR: Client::~Client _context should be null but
is not; client:conn



 Comments   
Comment by Mathias Stearn [ 02/Sep/11 ]

Do you still have the log file? Also could you try again on 1.8.3?

Comment by Michael Conigliaro [ 07/Apr/11 ]

Definitely on 1.8.1. I upgraded yesterday afternoon, and this crash happened this morning when I tried recreating an index that had previously been created with the keys in the wrong order. I also didn't have journaling enabled until about an hour ago (I thought it was enabled by default for some reason).

Comment by Eliot Horowitz (Inactive) [ 07/Apr/11 ]

Are you sure this was on 1.8.1 and not on 1.8.0?
Or did it happen on 1.8.0 first?
There was a bug in 1.8.0 that ws fixed in 1.8.1 with background indexing and journaling.

Comment by Michael Conigliaro [ 07/Apr/11 ]
  • Background : true?
  • Sparse?
  • Number of fields?

Here's the command I ran: db.userEventsJournal.ensureIndex(

{"adId":1, "eventId":1, "variationId":1}

,

{background:true}

);

  • Estimated size?

Here's the size of the index from a shard where the index was built successfully:

"adId_1_eventId_1_variationId_1" : 15024842448

  • When did it start running?

Thu Apr 7 16:20:20 [conn6268] building new index on

{ adId: 1.0, eventId: 1.0, variationId: 1.0 }

for socialmedia.userEventsJournal background
Thu Apr 7 17:26:34 Invalid access at address: 0x3255000

  • How fast was it moving? Was it slowing down / speeding up?

This is the last progress line I saw before the crash:

24787900/77278058 32%

Speed-wise, I would say it was about as consistent as all the other shards where the task completed successfully. It usually seems to start out pretty fast for the first few percent, but then it slows down considerably. Sometimes it'll take up to 5 minutes to move 1 percent. But at that point, I'd say it's pretty consistent, but slow. I'm not sure if that means anything.

Comment by Gaetan Voyer-Perrault [ 07/Apr/11 ]

Just re-read the title. Looks like you're building an index in the background.

Can you provide any insight on the index you're building?

  • Background : true?
  • Sparse?
  • Number of fields?
  • Estimated size?
  • When did it start running?
  • How fast was it moving? Was it slowing down / speeding up?
Comment by Michael Conigliaro [ 07/Apr/11 ]

To answer Gaetan Voyer-Perrault's questions on the mailing list:

  • Is journaling enabled?

Not at the moment.

  • Was there a change in connections?

Not really that I could tell. The connection count had been slowly increasing until the crash. Looks like about 3500 connections when the crash occurred, but I have other healthy machines with far more connections than that (6000-7000).

  • Was there a change in network traffic?

Traffic looked pretty consistent for the hour or so before the crash.

  • Was there any unusual behavior in the logs before / after?

The logs look pretty normal from what I can tell. This is basically what the logs look like before the crash:

Thu Apr 7 17:26:22 [conn8548] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 450ms
Thu Apr 7 17:26:22 [conn8549] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 430ms
Thu Apr 7 17:26:22 [conn8550] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 450ms
Thu Apr 7 17:26:22 [conn8551] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 325ms
Thu Apr 7 17:26:22 [conn8545] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 431ms
Thu Apr 7 17:26:22 [conn8543] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 431ms
Thu Apr 7 17:26:22 [conn8542] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 325ms
Thu Apr 7 17:26:22 [conn8547] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 325ms
Thu Apr 7 17:26:22 [conn8538] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 404ms
Thu Apr 7 17:26:22 [conn8541] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 451ms
Thu Apr 7 17:26:22 [conn8539] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 431ms
Thu Apr 7 17:26:22 [conn8540] query admin.$cmd ntoreturn:1 command:

{ getlasterror: 1 }

reslen:99 431ms
Thu Apr 7 17:26:22 [conn3835] insert socialmedia.userEventsJournal 878ms
Thu Apr 7 17:26:27 [conn3835] insert socialmedia.userEventsJournal 5158ms
Thu Apr 7 17:26:29 [conn2527] insert socialmedia.userEventsJournal 2122ms
Thu Apr 7 17:26:30 [conn3436] insert socialmedia.userEventsJournal 1118ms
24787900/77278058 32%
Thu Apr 7 17:26:33 [conn8565] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1 }

reslen:1274 2927ms
Thu Apr 7 17:26:34 [conn2099] insert socialmedia.userEventsJournal 1031ms
Thu Apr 7 17:26:34 [conn8566] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1.0 }

reslen:1274 1033ms

  • Was the shutdown "clean", did it re-start correctly?

No, the old lockfile was left behind.

  • Are the any associated hardware messages?

Well, I don't see anything strange in dmesg around the time of the crash. I do see quite a few of these from earlier though:

2719077.488378] INFO: task mongod:16140 blocked for more than 120 seconds.
[2719077.488388] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2719077.488393] mongod D ffff880885106898 0 16140 1 0x00000000
[2719077.488397] ffff88087834bdd8 0000000000000286 ffff88087834bd28 ffff88087834bd58
[2719077.488400] ffff88087834be50 ffff88087834bda0 ffff8808859ae9f0 ffff88087834bfd8
[2719077.488402] ffff8808859ae640 ffff8808859ae640 ffff8808859ae640 ffff88087834bfd8
[2719077.488404] Call Trace:
[2719077.488413] [<ffffffff8120f44d>] jbd2_log_wait_commit+0xcd/0x160
[2719077.488418] [<ffffffff81059fb0>] ? autoremove_wake_function+0x0/0x40
[2719077.488420] [<ffffffff8120f511>] ? jbd2_log_start_commit+0x31/0x60
[2719077.488424] [<ffffffff811c6fb4>] ext4_sync_file+0x134/0x2e0
[2719077.488428] [<ffffffff811157d9>] vfs_fsync_range+0x99/0xd0
[2719077.488430] [<ffffffff81115878>] vfs_fsync+0x18/0x20
[2719077.488434] [<ffffffff810d6ef9>] sys_msync+0x149/0x1d0
[2719077.488438] [<ffffffff81009ba8>] system_call_fastpath+0x16/0x1b
[2719077.488440] [<ffffffff81009b40>] ? system_call+0x0/0x52

I don't think they're related though, based on the timestamp I see here:

  1. cat /proc/uptime
    4298798.63 17002923.91

These messages are almost definitely from when I upgraded to 1.8.1 yesterday. For some reason, upgrading MongoDB causes the whole machine to hang for several minutes and shoots the load average to 100+, but I think this is an entirely different issue...

Generated at Thu Feb 08 03:01:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.