[SERVER-16073] Allow disabling SSL Ciphers via hidden flag: sslCipherConfig Created: 11/Nov/14  Updated: 06/Apr/17  Resolved: 13/Mar/15

Status: Closed
Project: Core Server
Component/s: Networking, Security
Affects Version/s: 2.6.4, 2.6.5
Fix Version/s: 2.6.9, 3.0.3, 3.1.0

Type: Bug Priority: Major - P3
Reporter: Nemanja Dubravac Assignee: Andreas Nilsson
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.5


Attachments: Text File mongod.log    
Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Operating System: Linux
Backport Completed:
Participants:
Case:

 Description   
Issue Status as of Mar 17, 2015

ISSUE SUMMARY
In some distributions, versions of OpenSSL, including the current OpenSSL packages for Red Hat Enterprise Linux and CentOS, use a faulty version of the AES-GCM cipher, that causes a crash after transmitting 64GB of data.

USER IMPACT
The instability created by issue makes it difficult to complete initial sync for replica set members and will cause frequent crashes.

WORKAROUNDS

  1. Upgrade to an upstream release of OpenSSL, avoiding the version provided by the package manager.
  2. Downgrade to an older version of OpenSSL.
  3. Use the new flag net.ssl.sslCipherConfig in MongoDB to configure the the OpenSSL cipher suites directly and disable AES-GCM. NOTE: configuring the cipher suites for OpenSSL should be done with care to avoid enabling unsafe ciphers.

AFFECTED VERSIONS
MongoDB 2.6 and 3.0 that use an affected OpenSSL library.

FIX VERSION
The fix is included in the 2.6.9 and 3.0.2 production releases.

RESOLUTION DETAILS
Implemented a configuration option to prevent mongod from using the problematic cipher, to provide an additional option for users who cannot change their OpenSSL version.

Original description

we're using mongodb (tried both 2.6.4 & 2.6.5) compiled with ssl on our own, and we're having problems recovering servers when they have to do a complete (initial) sync

2014-11-11T12:34:38.769+0100 [rsSync] ERROR: SSL: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac
2014-11-11T12:34:38.782+0100 [rsSync] SocketException: remote: 148.251.189.232:27018 error: 9001 socket exception [CONNECT_ERROR] 
2014-11-11T12:34:38.818+0100 [rsSync] ERROR: SSL: error:140D00CF:SSL routines:SSL_write:protocol is shutdown
2014-11-11T12:34:38.818+0100 [rsSync] caught exception (socket exception [CONNECT_ERROR] for ) in destructor (~PiggyBackData)
2014-11-11T12:34:38.819+0100 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 9 attempts remaining

after it fails multiple times, it finally gives up the initial sync and shuts down
what's interesting is that it always fails at the approximately same position (number of cloned objects), which is around 290K..

we could reproduce the same behavior on another server in a different shard of the same cluster

we're using openssl 1.0.1e 30.el6_6.4 on our mongod instances and the same version was used while compiling mongodb
as far as i understand, mongodb is using a dynamically linked library, so it actually shouldn't matter with which version it was compiled

we did find a bug in openssl which would output the same error message, but this bug was fixed in June 2014, and the version of openssl we're using already has the fix applied

did anyone else have this problem occur to them? is it an openssl bug, or mongodb bug?

i've attached a log file of one of the affected servers, with debug mode enabled



 Comments   
Comment by Githook User [ 17/Apr/15 ]

Author:

{u'username': u'spencerjackson', u'name': u'Spencer Jackson', u'email': u'spencer.jackson@mongodb.com'}

Message: SERVER-16073: Allow overrides to OpenSSL ciphers

(cherry picked from commit 83ae47b5780cddca30fd09b40fa4d897895a595f)
Branch: v3.0
https://github.com/mongodb/mongo/commit/4975cdb4ce711e18d16f737dabb1bac488b3456e

Comment by Githook User [ 13/Mar/15 ]

Author:

{u'username': u'spencerjackson', u'name': u'Spencer Jackson', u'email': u'spencer.jackson@mongodb.com'}

Message: SERVER-16073: Allow overrides to OpenSSL ciphers
Branch: v2.6
https://github.com/mongodb/mongo/commit/a5a48528b5349cc9af56b784f4fd583dcfb11e32

Comment by Githook User [ 13/Mar/15 ]

Author:

{u'username': u'spencerjackson', u'name': u'Spencer Jackson', u'email': u'spencer.jackson@mongodb.com'}

Message: SERVER-16073: Allow overrides to OpenSSL ciphers
Branch: master
https://github.com/mongodb/mongo/commit/83ae47b5780cddca30fd09b40fa4d897895a595f

Comment by Ramon Fernandez Marina [ 02/Mar/15 ]

james.wahlin@10gen.com tells me peter.garafano has a reproducer for this ticket, so I'm reopening it.

Comment by Ramon Fernandez Marina [ 02/Mar/15 ]

pcmaniac, we haven't heard back from you for a while so I'm resolving this ticket. If you find more time to investigate and have more information for us please feel free to re-open the ticket.

Regards,
Ramón.

Comment by Andreas Nilsson [ 07/Jan/15 ]

Hi pcmaniac did you find out anything more?

Comment by Nemanja Dubravac [ 15/Dec/14 ]

Ljuba tested with 2.8.0-rc0, but now when i check this version's changelog (https://jira.mongodb.org/issues/?jql=project%20%3D%20SERVER%20AND%20fixVersion%20%3D%20%222.8.0-rc0%22%20ORDER%20BY%20updated%20DESC%2C%20priority%20DESC%2C%20created%20ASC), i don't see that SERVER-15673 fix is included

you're using versions >= 2.6.6, in which SSLv3 is disabled (SERVER-15673), so I'm not sure it's a good test of the problem we had
when we find time (hopefully until the end of this week), we will try out 2.6.6 in our setup and see if the problem is resolved by this

Comment by Andreas Nilsson [ 09/Dec/14 ]

Just a short update. We have set up several initial sync sessions over SSL with > 100GB without being able to regenerate the crash. Now we are trying to exactly mimic your setup in order to reproduce.

Comment by Andreas Nilsson [ 01/Dec/14 ]

Thank you ljuba.ned@gmail.com. We will try to mimic your environment and get back to you with the results.

Comment by Ljuba Nedeljkovic [ 28/Nov/14 ]

Andreas,

We've tried with 2.8.0 rc0 built with ssl support and same thing happens.

@setup: you are correct. Below you can find output of rs.conf() for this minimal replica set as well as log excerpts upon mongod start.

>rs.conf()
{
	"_id" : "test_rs",
	"version" : 2,
	"members" : [
		{
			"_id" : 0,
			"host" : "server118.finderly.com:27018",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
 
			},
			"slaveDelay" : 0,
			"votes" : 1
		},
		{
			"_id" : 1,
			"host" : "server118.finderly.com:27118",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
 
			},
			"slaveDelay" : 0,
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatTimeoutSecs" : 10,
		"getLastErrorModes" : {
 
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		}
	}
}
 

Log exerpts:

Primary:

2014-11-20T16:51:22.759+0100 ***** SERVER RESTARTED *****
2014-11-20T16:51:22.766+0100 [initandlisten] MongoDB starting : pid=29721 port=27018 dbpath=/var/lib/mongo 64-bit host=server118.finderly.com
2014-11-20T16:51:22.766+0100 [initandlisten] db version v2.6.5
2014-11-20T16:51:22.766+0100 [initandlisten] git version: e99d4fcb4279c0279796f237aa92fe3b64560bf6
2014-11-20T16:51:22.766+0100 [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013

Would-be secondary:

2014-11-20T16:52:55.717+0100 ***** SERVER RESTARTED *****
2014-11-20T16:52:55.740+0100 [initandlisten] MongoDB starting : pid=29993 port=27118 dbpath=/var/lib/mongo-27118 64-bit host=server118.finderly.com
2014-11-20T16:52:55.740+0100 [initandlisten] db version v2.6.5
2014-11-20T16:52:55.740+0100 [initandlisten] git version: e99d4fcb4279c0279796f237aa92fe3b64560bf6
2014-11-20T16:52:55.740+0100 [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013

I hope this helps.

Best,
-Ljuba

Comment by Andreas Nilsson [ 26/Nov/14 ]

I cannot come up with any explanation on top of my head. Maybe there is a resource exhaustion or something similar in OpenSSL that causes this for long-lived, data intensive sessions.

If you have possibility to try it in your environment it would be interesting to see if the issue occurs in 2.8 RC0. We have disabled all SSLv3 ciphers due to POODLE, that might have a potential effects since the crash seems to occur in SSLv3 code.

We're gonna try to reproduce this on our side with large data sets. Let me know if I understand the setup correctly:

  • CentOS 6.5
  • mongod 2.6.5
  • Small replica set with primary with > 100 GB data directory, collection and document structure not important
  • Spin up new secondary, during initial sync SSL breaks down somewhere 100GB+ during the data transfer

Regards,
Andreas

Comment by Ljuba Nedeljkovic [ 24/Nov/14 ]

Hello Andreas,

Further experiments with initial sync of large database over SSL connection showed that:

  • the issue is not related to data types in individual documents; same error occurs with both combined long, binary, text and text fields only documents.
  • number of documents is irrelevant to this issue, errors occur with lower number of large documents and with high number of smaller documents
  • starting with empty data directory, sync fails with ~108GB of data in data directory. I was unable to determine how much of that amount is payload and how much is db engine internal stuff (engine wipes out "bad" data and starts over), but general feeling is that something happens at ~100GB of synched data which causes exceptions. Sync on databases smaller than 100GB works perfectly, again, no matter what data types are in documents, no matter if documents are large or small – it works.

I hope that this helps you to track down the issue.

Comment by Nemanja Dubravac [ 20/Nov/14 ]

unfortunately, that's not the reason

what mongodb lists in the log file is the output the same as the output of "openssl version" command: OpenSSL 1.0.1e-fips 11 Feb 2013

however, if i issue "rpm -q --changelog openssl", i see in the changelog that the latest change to openssl was made on "Thu Oct 16 2014", and the bugfix is included:

- fix CVE-2010-5298 - possible use of memory after free

this serverfault page contains more information about how CentOS and RedHat issue package updates: http://serverfault.com/a/604277/102901

Comment by Andreas Nilsson [ 17/Nov/14 ]

Hey, I spent some time looking through the logs. I couldn't find anything conclusive right away. One thing I did note however is that the version of OpenSSL used by the server seems to be from Feb 2013.

2014-11-10T10:22:12.928+0100 [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013

so if there was a fix in June 2014 I guess it wouldn't be included. Can you double check the OpenSSL version on the target system.

Comment by Nemanja Dubravac [ 14/Nov/14 ]

Hi Andreas,

Thanks for looking into this issue!

in the log file I provided, there is first a part near one sample failure (2014-11-09T22:33:30.511+0100)

then, there is a server restart with an emptied data directory, where you can see the whole process from initial startup, initial sync and including the first SSL exception (2014-11-10T10:54:30.514+0100)
during our testing, it happened every time at round 290K synced documents, there would be such an SSL3 exception, the initial sync would be completely restarted 9 more times, it would fail 9 more times with the same exception after syncing almost the same number of documents, and then it would finally give up with a fatal assertion

2014-11-09T22:33:30.511+0100 [rsSync] ERROR: SSL: error:1408F119:SSL routines:SSL3_GET_RECORD:decryption failed or bad record mac
2014-11-09T22:33:30.511+0100 [rsSync] SocketException: remote: ###ip-address-1###:27018 error: 9001 socket exception [CONNECT_ERROR] 
2014-11-09T22:33:30.630+0100 [rsSync] ERROR: SSL: error:140D00CF:SSL routines:SSL_write:protocol is shutdown
2014-11-09T22:33:30.630+0100 [rsSync] caught exception (socket exception [CONNECT_ERROR] for ) in destructor (~PiggyBackData)
2014-11-09T22:33:30.630+0100 [rsSync] replSet initial sync exception: 16465 recv failed while exhausting cursor 0 attempts remaining
...
2014-11-09T22:34:00.630+0100 [rsSync] Fatal Assertion 16233
2014-11-09T22:34:00.753+0100 [rsSync] 0xf53ad6 0xf01462 0xeea693 0xce8bce 0xd01c1d 0xd01d4a 0xd01fe2 0xf8c4cc 0x7f1f5f5e49d1 0x7f1f5e33a86d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x26) [0xf53ad6]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x1a2) [0xf01462]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xeea693]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x1ae) [0xce8bce]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0xcd) [0xd01c1d]
 /usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x8a) [0xd01d4a]
 /usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x82) [0xd01fe2]
 /usr/bin/mongod() [0xf8c4cc]
 /lib64/libpthread.so.0(+0x79d1) [0x7f1f5f5e49d1]
 /lib64/libc.so.6(clone+0x6d) [0x7f1f5e33a86d]
2014-11-09T22:34:00.753+0100 [rsSync] 
 
***aborting after fassert() failure

the replica set is a part of a sharded cluster, but i don't think it makes a difference in this case
there were no stepdowns during many attempts of an initial sync

we were able to do a resync after we stopped requiring ssl on this cluster, so it tells us that the problem is either in the openssl library, the way we compiled mongodb to use openssl, or in mongodb's implementation of communication over ssl
if you could give us ideas what we could try to narrow down the cause, we would be grateful

thanks!

Comment by Andreas Nilsson [ 14/Nov/14 ]

Hi pcmaniac, we will need some more information in order to reproduce and determine where the problem lies. It would be helpful to see more of the logs some time before the problem starts happening.

Specifically it would be interesting to know details about the replica set and whether or not there were any stepdowns during the initial sync process.

Generated at Thu Feb 08 03:39:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.