[SERVER-12650] Segfault when replicating from 2.4.6 Created: 06/Feb/14  Updated: 11/Jul/16  Resolved: 18/Feb/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.5.5
Fix Version/s: 2.6.0-rc0

Type: Bug Priority: Major - P3
Reporter: Bob Potter Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Setup a 2.5.5 server as a replica of a 2.4.6 server.

Participants:

 Description   

I'm running 2.5.5 as a hidden replica in a replSet that is composed of 2.4.6 instances. About once a day the server segfaults. I've included the backtrace and some of the log prior to the segfault. Let me know if I there is any additional information I can provide that would be useful.

Invalid access at address: 0 from thread: rsSyncNotifier
Got signal: 11 (Segmentation fault).
Backtrace:
0x114d711 0x751cba 0x757402 0x7f090424dcb0 0x1100c2d 0x1102086 0x1102671 0x7b20a5 0x7dc027 0x7dc
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x114d711]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x3ba) [0x751cba]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x282) [0x757402]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f090424dcb0]
 /usr/bin/mongod(_ZN5mongo13MessagingPort4recvERNS_7MessageE+0x87d) [0x1100c2d]
 /usr/bin/mongod(_ZN5mongo13MessagingPort4recvERKNS_7MessageERS1_+0x36) [0x1102086]
 /usr/bin/mongod(_ZN5mongo13MessagingPort4callERNS_7MessageES2_+0x31) [0x1102671]
 /usr/bin/mongod(_ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs+0x55) [0x7b20a5]
 /usr/bin/mongod(_ZN5mongo14DBClientCursor11requestMoreEv+0x397) [0x7dc027]
 /usr/bin/mongod(_ZN5mongo14DBClientCursor4moreEv+0x4d) [0x7dc4ed]
 /usr/bin/mongod(_ZN5mongo7replset14BackgroundSync9markOplogEv+0xb3e) [0xd8d00e]
 /usr/bin/mongod(_ZN5mongo7replset14BackgroundSync14notifierThreadEv+0x15e) [0xd8f8ce]
 /usr/bin/mongod() [0x1192459]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f0904245e9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f09035583fd]

There were also some interesting logs prior to the segfault:

2014-02-06T13:34:45.883-0500 [rsBackgroundSync] Socket recv() timeout  10.145.170.96:27017
2014-02-06T13:34:45.883-0500 [rsBackgroundSync] SocketException: remote: 10.145.170.96:27017 error: 9001 socket exception [RECV_TIMEOUT] server [10.145.170.96:27017] 
2014-02-06T13:34:45.883-0500 [rsBackgroundSync] DBClientCursor::init call() failed
2014-02-06T13:35:08.642-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:08.911-0500 [rsSyncNotifier] replset setting oplog notifier to [server]:27017
2014-02-06T13:35:08.914-0500 [rsSyncNotifier] upstream updater is unsupported on this version
2014-02-06T13:35:08.922-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:08.995-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.174-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.219-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.240-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.250-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.259-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.269-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.294-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.307-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.322-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.334-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.344-0500 [rsBackgroundSync] upstream updater is unsupported on this version
2014-02-06T13:35:09.391-0500 [rsBackgroundSync] upstream updater is unsupported on this version



 Comments   
Comment by Githook User [ 20/Feb/14 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-12650 locking to protect the fallback oplogreader
Branch: master
https://github.com/mongodb/mongo/commit/3172b29a062d7480ab0b7e0b9a5f13cb271415c1

Comment by Githook User [ 20/Feb/14 ]

Author:

{u'username': u'dannenberg', u'name': u'matt dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: Revert "SERVER-12650 make syncSourceFeedback's oplogreader effectively single threaded"

This reverts commit 617b4896241649e2a3203e33e5901907790002b5.
Branch: master
https://github.com/mongodb/mongo/commit/b6a6d3a7231291aaa748523033b46158ba2332c5

Comment by Githook User [ 18/Feb/14 ]

Author:

{u'username': u'dannenberg', u'name': u'Matt Dannenberg', u'email': u'matt.dannenberg@10gen.com'}

Message: SERVER-12650 make syncSourceFeedback's oplogreader effectively single threaded
Branch: master
https://github.com/mongodb/mongo/commit/617b4896241649e2a3203e33e5901907790002b5

Comment by Bob Potter [ 13/Feb/14 ]

This is a production environment so I'm not currently able to update the primary to 2.4.9. I've updated the 2.5.5 replica to log level 3 and can provide you with better logs if it happens again.

I saw the crash a couple times before I created the issue but haven't seen it since.

Comment by Asya Kamsky [ 09/Feb/14 ]

I've replicated this failure on my mac (twice). Running 2.5.5 against 2.4.8 primary.
Second time I had secondary in logLevel3.

The stack trace looks a little different but only in the system library files...

libc++abi.dylib: Pure virtual function called!
Got signal: 6 (Abort trap: 6).
Backtrace:
0x100662deb 0x100000efb 0x7fff8c49a5aa 0 0x7fff927fabba 0x7fff8e7b7141 0x7fff8e7db617 0x10062680f 0x100626a6f 0x100626f71 0x100037e4a 0x1000380a0 0x100053c2f 0x10005253b 0x1003ff1b7 0x1003ff4e1 0x100697215 0x7fff90be2899 0x7fff90be272a 0x7fff90be6fc9
 0   mongod                              0x0000000100662deb _ZN5mongo15printStackTraceERSo + 43
 1   mongod                              0x0000000100000efb _ZN5mongo10abruptQuitEi + 251
 2   libsystem_platform.dylib            0x00007fff8c49a5aa _sigtramp + 26
 3   ???                                 0x0000000000000000 0x0 + 0
 4   libsystem_c.dylib                   0x00007fff927fabba abort + 125
 5   libc++abi.dylib                     0x00007fff8e7b7141 __cxa_bad_cast + 0
 6   libc++abi.dylib                     0x00007fff8e7db617 __cxa_deleted_virtual + 0
 7   mongod                              0x000000010062680f _ZN5mongo13MessagingPort4recvERNS_7MessageE + 1493
 8   mongod                              0x0000000100626a6f _ZN5mongo13MessagingPort4recvERKNS_7MessageERS1_ + 37
 9   mongod                              0x0000000100626f71 _ZN5mongo13MessagingPort4callERNS_7MessageES2_ + 49
 10  mongod                              0x0000000100037e4a _ZN5mongo18DBClientConnection4callERNS_7MessageES2_bPSs + 94
 11  mongod                              0x00000001000380a0 _ZThn136_N5mongo18DBClientConnection4callERNS_7MessageES2_bPSs + 16
 12  mongod                              0x0000000100053c2f _ZN5mongo14DBClientCursor11requestMoreEv + 965
 13  mongod                              0x000000010005253b _ZN5mongo14DBClientCursor4moreEv + 75
 14  mongod                              0x00000001003ff1b7 _ZN5mongo7replset14BackgroundSync9markOplogEv + 825
 15  mongod                              0x00000001003ff4e1 _ZN5mongo7replset14BackgroundSync14notifierThreadEv + 655
 16  mongod                              0x0000000100697215 thread_proxy + 229
 17  libsystem_pthread.dylib             0x00007fff90be2899 _pthread_body + 138
 18  libsystem_pthread.dylib             0x00007fff90be272a _pthread_struct_init + 0
 19  libsystem_pthread.dylib             0x00007fff90be6fc9 thread_start + 13

Logs from the 2.5.5 secondary (it was redoing full sync because that's what it died in the middle of the time before (when I had it in logLevel 1)

2014-02-09T00:14:42.907-0500 [rsSync] replSet initial sync cloning indexes for : exxx
2014-02-09T00:14:42.907-0500 [rsSync] 		cloning collection exxx.system.indexes to exxx.system.indexes on asyasmacbook.local:40001 with filter { name: { $ne: "_id_" }, ns: { $nin: [] } }
2014-02-09T00:14:42.915-0500 [rsSync] Database::_addNamespaceToCatalog ns: exxx.profs.$profile_1_dims.n_1_dims.v_1
2014-02-09T00:14:42.915-0500 [rsSync] build index on: exxx.profs properties: { v: 1, key: { profile: 1.0, dims.n: 1.0, dims.v: 1.0 }, name: "profile_1_dims.n_1_dims.v_1", ns: "exxx.profs" }
2014-02-09T00:14:42.915-0500 [rsSync] exxx.profs: clearing plan cache - collection info cache reset
2014-02-09T00:14:42.915-0500 [rsSync] allocating new extent
2014-02-09T00:14:42.916-0500 [rsSync] ExtentManager::increaseStorageSize ns:exxx.profs.$profile_1_dims.n_1_dims.v_1 desiredSize:131072 fromFreeList: 0 eloc: 2:6e881000
2014-02-09T00:14:42.916-0500 [rsSync] _reuse extent was: now:exxx.profs.$profile_1_dims.n_1_dims.v_1
2014-02-09T00:14:42.918-0500 [rsSync] 	 building index using bulk method
2014-02-09T00:14:42.927-0500 [rsSync] 	 bulk commit starting
2014-02-09T00:14:42.927-0500 [rsSync] exxx.profs: clearing plan cache - index { profile: 1.0, dims.n: 1.0, dims.v: 1.0 } set to multi key.

For completeness, log from the first crash ended on:

2014-02-08T23:18:34.319-0500 [rsSync] replSet initial sync cloning indexes for : exxx
2014-02-08T23:18:34.327-0500 [rsSync] Database::_addNamespaceToCatalog ns: exxx.profs.$profile_1_dims.n_1_dims.v_1
2014-02-08T23:18:34.328-0500 [rsSync] build index on: exxx.profs properties: { v: 1, key: { profile: 1.0, dims.n: 1.0, dims.v: 1.0 }, name: "profile_1_dims.n_1_dims.v_1", ns: "exxx.profs" }
2014-02-08T23:18:34.328-0500 [rsSync] exxx.profs: clearing plan cache - collection info cache reset
2014-02-08T23:18:34.329-0500 [rsSync] allocating new extent
2014-02-08T23:18:34.330-0500 [rsSync] ExtentManager::increaseStorageSize ns:exxx.profs.$profile_1_dims.n_1_dims.v_1 desiredSize:131072 fromFreeList: 0 eloc: 2:6e881000

Comment by Matt Dannenberg [ 07/Feb/14 ]

We did some preliminary research and are struggling to find the path it's taking. If you could post longer logs from the crashing node with log verbosity set to 1 or greater, we could use the increased output for clues as to what's happening here.

Comment by Scott Hernandez (Inactive) [ 06/Feb/14 ]

Can you please upload the full logs from the node that is crashing (with a few minutes before), as well as the primary at the time of the crash? If this happens regularly it would also be helpful if you can increase the verbosity of the logs (to level 3) before it happens: http://docs.mongodb.org/manual/reference/parameters/#param.logLevel

Are you testing with the 2.5.5 development release in your production environment, or is this a test environment? If it is a test environment, can you please upgrade to the latest 2.4.x release (currently 2.4.9) to see if the problem still happens?

It is important that we be able to see which member are communicating to which other members, so if you do replace any hostnames in the logs please make sure to be consistent with the replacement so we can still track the relationships. If you want to provide the logs privately let us know and we can create a (community) private issues which (will not be public) to post the logs to.

Generated at Thu Feb 08 03:29:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.