[SERVER-21105] Active find/getmore commands segfault when repl PV changes from 1->0 Created: 23/Oct/15  Updated: 27/Oct/15  Resolved: 26/Oct/15

Status: Closed
Project: Core Server
Component/s: Querying, Replication
Affects Version/s: 3.2.0-rc0
Fix Version/s: 3.2.0-rc1

Type: Bug Priority: Critical - P2
Reporter: Timothy Olsen (Inactive) Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File primary.log     Text File secondary1.log     Text File secondary2.log    
Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:
  1. Bring up a 3-node 3.2.0-rc1-pre- replica set on Mac OS X
  2. Reconfigure the replica set with protocolVersion=0
  3. The primary will crash
Participants:

 Description   

Reconfiguring a mongodb 3.2 replica set with protocolVersion = 0 crashes the primary on Mac OS X. This does not happen on linux. This does not happen with a 1-member replica set. It does happen with a 3-member replica set.

Logs for all 3 members attached.

This was with 3.2.0-rc1-pre- commit dbbc9a2e3d8c4d7fe1748fa980ba7d01b9489dbe



 Comments   
Comment by Githook User [ 26/Oct/15 ]

Author:

{u'username': u'visualzhou', u'name': u'Siyuan Zhou', u'email': u'siyuan.zhou@mongodb.com'}

Message: SERVER-21105 Fix potential segfault after downgrade to pv0.
Branch: master
https://github.com/mongodb/mongo/commit/649fab49c27ce39c1f06505c861c84cb4e493649

Comment by Scott Hernandez (Inactive) [ 24/Oct/15 ]

This is triggered when a getmore (from the secondary doing replication) is active and the protocol version changes out from under it. The replication source (primary in this case) tries to update the term even though the protocol version is now 0, even though the client is sending a term (which was valid at the start of the operation, but not at the end).

Comment by Scott Hernandez (Inactive) [ 23/Oct/15 ]

From the logs this looks like a problem for all platforms so I'll take a look at it this weekend to get a general repro jstest, and see if it triggers on linux/win64.

Comment by Timothy Olsen (Inactive) [ 23/Oct/15 ]

It's reproduced every time I've tried it

Comment by Timothy Olsen (Inactive) [ 23/Oct/15 ]

Here is my shell session:

[tim@neurofunk dbs]$ /tmp/mms-automation//test/versions/mongodb-osx-x86_64-3.2.0-rc1-pre-/bin/mongo
MongoDB shell version: 3.2.0-rc1-pre-
connecting to: test
Server has startup warnings: 
2015-10-23T15:21:08.222-0400 I CONTROL  [initandlisten] 
2015-10-23T15:21:08.222-0400 I CONTROL  [initandlisten] ** WARNING: soft rlimits too low. Number of files is 256, should be at least 1000
> rs.initiate()
{
	"info2" : "no configuration specified. Using a default configuration for the set",
	"me" : "neurofunk.local:27017",
	"ok" : 1
}
rs0:SECONDARY> 
rs0:PRIMARY> rs.add("neurofunk.local:27018")
{ "ok" : 1 }
rs0:PRIMARY> rs.add("neurofunk.local:27019")
{ "ok" : 1 }
rs0:PRIMARY> cfg = rs.conf()
{
	"_id" : "rs0",
	"version" : 3,
	"protocolVersion" : NumberLong(1),
	"members" : [
		{
			"_id" : 0,
			"host" : "neurofunk.local:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "neurofunk.local:27018",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 2,
			"host" : "neurofunk.local:27019",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 5000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}
rs0:PRIMARY> cfg["protocolVersion"] = 0
0
rs0:PRIMARY> cfg["version"] = 4
4
rs0:PRIMARY> rs.reconfig(cfg)
{ "ok" : 1 }
rs0:PRIMARY> 

I had no other connections open to the replica set or any of its members at the time.

Comment by Scott Hernandez (Inactive) [ 23/Oct/15 ]

Can you include your (shell) script which repro'd this? It looks like a getmore caused the crash, maybe while the reconfig was executing. Could there have been any other client traffic/ops going on at the same time?

Generated at Thu Feb 08 03:56:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.