[SERVER-19322] Segmentation fault during replication in a sharded replicated mongodb environment Created: 07/Jul/15  Updated: 13/Jul/15  Resolved: 13/Jul/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.3
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Praveen Akinapally Assignee: Sam Kleinman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-18460 Segfault during eviction under load Closed
Operating System: ALL
Steps To Reproduce:

Deployment Environment:
=======================
Ubuntu 14.04.
Shared Replica Set Deployment.

Have a Primary sharded mongodb server mongodb1 (priority set to 2) and has data sharded. Add replicate set members mongodb0 (priority set to 3) and mongodb2 (priority set to 1). Data starts syncing to replica set members mongodb0 and mongodb2. As soon as mongodb0 transitions from STARTUP2 to PRIMARY mongod process on mongodb2 fails with the above mentioned stack trace.

mongod instance running on mongodb0 transitions to primary :
2015-07-07T01:42:11.940+0000 I REPL [ReplicationExecutor] transition to PRIMARY

mongod instance running on mongodb2 fails with segmentation fault :
2015-07-07T01:42:21.012+0000 F - Invalid access at address: 0xc8
2015-07-07T01:42:21.055+0000 F - Got signal: 11 (Segmentation fault).

This process used to run fine until WiredTiger Upgrade.

Participants:

 Description   

We have 3 mongodb servers mongodb0 (priority=3), mongodb1 (priority=2) and mongodb2 (priority=1). mongodb1 has full data and we start data sync to empty mongodb0 and mongodb2 machines by adding them as replica set members.

As soon as mongodb0 sync completes successfully and transitions to primary mongodb2 sync stops and fails with Segmentation fault Error. This process used to run fine until WiredTiger Upgrade. This process used to run fine until WiredTiger Upgrade.

Error Stack Trace:
==================

2015-07-07T01:42:21.012+0000 F -        Invalid access at address: 0xc8
2015-07-07T01:42:21.055+0000 F -        Got signal: 11 (Segmentation fault).
 
0xf51949 0xf51212 0xf5156e 0x7f0f37228340 0x13652c1 0x13668b3 0x1368ae3 0x1339b74 0x133757d 0x133968c 0x7f0f37220182 0x7f0f35ce930d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B51949"},{"b":"400000","o":"B51212"},{"b":"400000","o":"B5156E"},{"b":"7F0F37218000","o":"10340"},{"b":"400000","o":"F652C1"},{"b":"400000","o":"F668B3"},{"b":"400000","o":"F68AE3"},{"b":"400000","o":"F39B74"},{"b":"400000","o":"F3757D"},{"b":"400000","o":"F3968C"},{"b":"7F0F37218000","o":"8182"},{"b":"7F0F35BEE000","o":"FB30D"}],"processInfo":{ "mongodbVersion" : "3.0.3", "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105", "uname" : { "sysname" : "Linux", "release" : "3.13.0-29-generic", "version" : "#53-Ubuntu SMP Wed Jun 4 21:00:20 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "F56F80CB96B4DBFC070BEB0ADAC7D6B274BFC6B1" }, { "b" : "7FFF0EAFE000", "elfType" : 3, "buildId" : "3D068D088E7EAC15D9DA7C3AC912E783C0897EE7" }, { "b" : "7F0F37218000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "FE662C4D7B14EE804E0C1902FB55218A106BC5CB" }, { "b" : "7F0F36FBA000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "6C7AE380840DB9034D7763771B55E51B31BCAF14" }, { "b" : "7F0F36BE0000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "3D522D8E04F5FD7904AE69B50CA8835A71024490" }, { "b" : "7F0F369D8000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7F0F367D4000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7F0F364D0000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }, { "b" : "7F0F361CA000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "574C6350381DA194C00FF555E0C1784618C05569" }, { "b" : "7F0F35FB4000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "CC0D578C2E0D86237CA7B0CE8913261C506A629A" }, { "b" : "7F0F35BEE000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B571F83A8A6F5BB22D3558CDDDA9F943A2A67FD1" }, { "b" : "7F0F37436000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf51949]
 mongod(+0xB51212) [0xf51212]
 mongod(+0xB5156E) [0xf5156e]
 libpthread.so.0(+0x10340) [0x7f0f37228340]
 mongod(+0xF652C1) [0x13652c1]
 mongod(+0xF668B3) [0x13668b3]
 mongod(__wt_reconcile+0x1B3) [0x1368ae3]
 mongod(__wt_evict+0x104) [0x1339b74]
 mongod(__wt_evict_page+0x2D) [0x133757d]
 mongod(+0xF3968C) [0x133968c]
 libpthread.so.0(+0x8182) [0x7f0f37220182]
 libc.so.6(clone+0x6D) [0x7f0f35ce930d]
-----  END BACKTRACE  -----



 Comments   
Comment by Praveen Akinapally [ 10/Jul/15 ]

Thanks Alexander and Sam. I upgraded my Mongo Cluster to 3.0.4 and it works well now.

Comment by Alexander Gorrod [ 08/Jul/15 ]

I agree with samk that this should be fixed in 3.0.4, but I think the fix was a different issue SERVER-18460.

The particular problem was that WiredTiger had a bug where a page could be evicted from cache at the same time as the collection was being removed. Which lead to a race condition where the page could be freed by two different threads at the same time.

Comment by Sam Kleinman (Inactive) [ 08/Jul/15 ]

Thanks for the report, and sorry that you've hit this issue.

This looks related to SERVER-17047, which was fixed in 3.0.4.

Can you upgrade to 3.0.4 and see if this resolves your issue?

Regards,
sam

Comment by Sam Kleinman (Inactive) [ 08/Jul/15 ]

full addr2line:

 0: /data/mci/src/src/mongo/util/stacktrace_posix.cpp:105
 1: /data/mci/src/src/mongo/util/signal_handlers_synchronous.cpp:127
 2: /data/mci/src/src/mongo/util/signal_handlers_synchronous.cpp:240
 3: ??:0
 4: /data/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:1635
 5: /data/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:4272 (discriminator 3)
 6: /data/mci/src/src/third_party/wiredtiger/src/reconcile/rec_write.c:413
 7: /data/mci/src/src/third_party/wiredtiger/src/evict/evict_page.c:360
 8: /data/mci/src/src/third_party/wiredtiger/src/evict/evict_lru.c:697
 9: /data/mci/src/src/third_party/wiredtiger/src/include/btree.i:1060
10: ??:0
11: ??:0

Potentially related to SERVER-17047. cc: alexander.gorrod

Will recommend an upgrade to 3.0.4 in the mean time.

Generated at Thu Feb 08 03:50:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.