[SERVER-32677] Segmentation fault converting ReplicaSet to Replicated Shard Cluster Created: 12/Jan/18  Updated: 30/Oct/23  Resolved: 20/Apr/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.1
Fix Version/s: 3.6.4, 3.7.6

Type: Bug Priority: Major - P3
Reporter: Gianluca De Cicco Assignee: Blake Oler
Resolution: Fixed Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

docker container mongo:3.6.1 (debian jessy)

https://github.com/docker-library/mongo/blob/657b1a53a9680b972a6344f3d958a17775dd8719/3.6/Dockerfile


Attachments: File arbitrer.conf     File mongo_primary.log     File mongo_secondary.log     File mongod.conf    
Issue Links:
Backports
Depends
depends on SERVER-29908 Libraries db/s/sharding and db/query/... Closed
is depended on by TOOLS-2011 Restore sharded cluster testing after... Closed
Duplicate
is duplicated by SERVER-32921 Invalid access at address: 0x18 Closed
is duplicated by SERVER-33385 MongoDB 3.6 crashes on Ubuntu 16.04 A... Closed
is duplicated by SERVER-34206 All replica nodes crash Closed
is duplicated by SERVER-34530 Shard server crashes after access vio... Closed
Problem/Incident
causes SERVER-33376 MongoDB 3.6.3-rc0 config server segfa... Closed
causes SERVER-34746 Segmentation fault when shard is star... Closed
Related
related to SERVER-71106 Access to Grid members should be prot... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

start 2 data node with: (config attached)

$ mongod --config /data/mongod.conf --replSet rs
start the arbiter with:
$ mongod --config /data/arbitrer.conf --replSet rs

connect to one data node and run the replicaset init

> rs.initiate({ _id: "rs", members: [{ _id: 1, host: "mongo_replica1:27017" }, { _id: 2, host: "mongo_replica2:27017" }], settings: { getLastErrorDefaults: { w: "majority", wtimeout: 30000 }}})

connecting to replicaset "rs/mongo_replica1:27017,mongo_replica2:27017" to add arbitrer

> rs.addArb("mongo_arbitrer:27017")

now following https://docs.mongodb.com/manual/tutorial/convert-replica-set-to-replicated-shard-cluster/#restart-the-replica-set-as-a-shard

stop secondary and run

$ mongod --config /data/mongod.conf --shardsvr --replSet rs
stop arbitrer and run
$ mongod --config /data/arbitrer.conf --shardsvr --replSet rs

connect to primary and stepDown

> rs.stepDown()

restart old Primary with

$ mongod --config /data/mongod.conf --shardsvr --replSet rs

everything reconnect. After some minutes (around 5. it's cyclic) in idle one data node receive SIGSEGV and on cascade also the other data node (but not the arbitrer) receive the same SIGSEGV.

Sprint: Sharding 2018-02-12, Sharding 2018-02-26, Sharding 2018-03-12, Sharding 2018-03-26, Sharding 2018-04-23
Participants:
Case:

 Description   

The starting point are 3 node: 2 data barer and 1 arbitrer. All nodes started without the flag --shardsvr. Once the replicaset is initialized (initialization + addArb) it cannot be converted to a replicated shard.

Once you restart the node with the flag --shardsvr they, after a while, access a bad memory segment (Invalid access at address: 0x18) and receive signal SIGSEGV.

If restarted again they continue to receive the same signal after some time in idle (more or less 5 minute)



 Comments   
Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: SERVER-32677 Prevent sessions periodic refresh from prematurely accessing sharding internals

(cherry picked from commit 60cb34ea7351d25b0eb6bee947d21ada09cf438b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/9e4b78f198fa6a0bca75fd1012d8437d98a3c825

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: Revert "SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster"

This reverts commit 424111b2a3f4c30b7e637f4eadda6a18df9bf065.
Branch: v3.6
https://github.com/mongodb/mongo/commit/3df74e4206d910946abf9f1d7e6c281de329bf5e

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: SERVER-32677 Prevent sessions periodic refresh from prematurely accessing sharding internals

(cherry picked from commit 60cb34ea7351d25b0eb6bee947d21ada09cf438b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/9e4b78f198fa6a0bca75fd1012d8437d98a3c825

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: Revert "SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster"

This reverts commit 424111b2a3f4c30b7e637f4eadda6a18df9bf065.
Branch: v3.6
https://github.com/mongodb/mongo/commit/3df74e4206d910946abf9f1d7e6c281de329bf5e

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: SERVER-32677 Prevent sessions periodic refresh from prematurely accessing sharding internals

(cherry picked from commit 60cb34ea7351d25b0eb6bee947d21ada09cf438b)
Branch: v3.6
https://github.com/mongodb/mongo/commit/9e4b78f198fa6a0bca75fd1012d8437d98a3c825

Comment by Githook User [ 01/May/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: Revert "SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster"

This reverts commit 424111b2a3f4c30b7e637f4eadda6a18df9bf065.
Branch: v3.6
https://github.com/mongodb/mongo/commit/3df74e4206d910946abf9f1d7e6c281de329bf5e

Comment by Kaloian Manassiev [ 01/May/18 ]

On the 3.6 branch: https://github.com/mongodb/mongo/commit/9e4b78f198fa6a0bca75fd1012d8437d98a3c825

Comment by Githook User [ 20/Apr/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'username': 'BlakeIsBlake', 'name': 'Blake Oler'}

Message: SERVER-32677 Prevent sessions periodic refresh from prematurely accessing sharding internals
Branch: master
https://github.com/mongodb/mongo/commit/60cb34ea7351d25b0eb6bee947d21ada09cf438b

Comment by Gregory McKeon (Inactive) [ 22/Mar/18 ]

Clarifying the confusing state of this ticket:

The fix for the 3.6 branch has been completed, and is in the 3.6.4 release.

The fix in master is currently blocked, and this ticket will remain unresolved until the fix in master is pushed.

Comment by Githook User [ 28/Feb/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster
Branch: v3.6
https://github.com/mongodb/mongo/commit/424111b2a3f4c30b7e637f4eadda6a18df9bf065

Comment by Githook User [ 16/Feb/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: Revert "SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster"

This reverts commit cad0d35091f98b5c2bb37765861841844bd9e16d.
Branch: v3.6
https://github.com/mongodb/mongo/commit/9586e557d54ef70f9ca4b43c26892cd55257e1a5

Comment by Githook User [ 06/Feb/18 ]

Author:

{'email': 'blake.oler@mongodb.com', 'name': 'Blake Oler', 'username': 'BlakeIsBlake'}

Message: SERVER-32677 Fix segmentation fault when converting a replica set to a replicated sharded cluster
Branch: v3.6
https://github.com/mongodb/mongo/commit/cad0d35091f98b5c2bb37765861841844bd9e16d

Comment by Kaloian Manassiev [ 24/Jan/18 ]

The problem happens when the logical session cache happens to run before the sharding initialization has completed. I think the fix should be to add a check that ShardingState::get(opCtx)->enabled() before attempting to reference the catalog cache here.

As part of this fix we should see if there are other places which may be accessing sharding infrastructure, which is initialized late.

Comment by Mark Agarunov [ 12/Jan/18 ]

Hello gdecicco,

Thank you for the report. I've set the fixVersion to "Needs Triage" for this new feature to be scheduled against our currently planned work. Updates will be posted on this ticket as they happen.

Thanks,
Mark

Comment by Gianluca De Cicco [ 12/Jan/18 ]

I cannot edit the description, there is a typo:
"Once you restart the node with the flag --replSet --shardsvr"

Comment by Gianluca De Cicco [ 12/Jan/18 ]

Addendum: if the flag --shardsvr is removed the replicaset stop receiving SIGSEGV

Generated at Thu Feb 08 04:30:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.