[SERVER-9846] Slave replication failure replHandshake unauthorized Created: 01/Jun/13 Updated: 10/Dec/14 Resolved: 10/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Security |
| Affects Version/s: | 2.4.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Peter Chan | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
I previously have a database in Mongo 1.8 that was running master/slave replication. The data store has three databases: local, admin, and appdb I upgraded directly to Mongo 2.4.3, and I am starting to get sporadic master/slave replication failure. Specifically, I am seeing on the slaves: Sat Jun 1 14:48:29.702 [replslave] Socket recv() timeout 127.0.0.1:37017 And on the master: Sat Jun 01 09:49:39.024 [conn1767] authenticate db: local { authenticate: 1, nonce: "XXXXXXXXXX", user: "repl", key: "XXXXXXXXX" }Sat Jun 01 09:49:39.118 [conn1767] command denied: { handshake: ObjectId('XXXXXXXX-changes every time it outputs') }The error is not always present. I have three slave instances. One is replicating fine with no error messages. The other two would start replication, but at some point encounter the "unauthorized" errmsg. I tried deleting and re-creating the slave database, including adding the "repl" user to the local database. That works for a while, then the error comes up again. I tried changing the "repl" user on the master local database, including removing the old "readOnly" field and adding in roles ['read', 'readWrite', 'dbAdmin'], but it still doesn't work. |
| Comments |
| Comment by auto [ 10/Jun/13 ] |
|
Author: {u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}Message: |
| Comment by Peter Chan [ 10/Jun/13 ] |
|
Thank you. Appreciate it. |
| Comment by Spencer Brody (Inactive) [ 10/Jun/13 ] |
|
To summarize the original master/slave auth issues: The old way of enabling access control in a master/slave system, of adding a user named "repl" to the local database of all nodes, no longer is valid as of 2.4. The new way to enable auth on a master/slave system is to use --keyFile, just as is done for Replica Sets. |
| Comment by Spencer Brody (Inactive) [ 10/Jun/13 ] |
|
Hi Peter, |
| Comment by Peter Chan [ 08/Jun/13 ] |
|
Ok, I upgraded the master back to 2.4.3. The problem is now re-occurring on one of the 2.4.3 slaves. It is now producing: Fri Jun 07 23:04:56.207 [replslave] DBClientCursor::init call() failed Another 2.4.3 slave, on the same server, connected to the same SSH tunnel, is replicating fine (note contemporary timestamps): Fri Jun 07 23:02:25.520 [replslave] repl: checkpoint applied 1320 operations I also connected to the localhost end point of the SSH tunnel with a 2.4 shell, and this is the transcript: MongoDB shell version: 2.4.3 |
| Comment by Peter Chan [ 07/Jun/13 ] |
|
Ok. This will take me a day or two, as I need to schedule the master offline and upgrade it back to 2.4.3 from 2.2.4. Are there any commands you want me to run on the 2.4 shell over the ssh tunnel, if I can establish connection successfully? I ask because, from the look of the log, the initial connection is fine, but something subsequently happens with the "DBClientCursor::init call() failed" line. |
| Comment by Spencer Brody (Inactive) [ 07/Jun/13 ] |
|
Can you try connecting to the master with a 2.4 shell from the slave machine using the same hostname/IP that the slave node is using (I believe that's localhost:37017)? |
| Comment by Peter Chan [ 06/Jun/13 ] |
|
Also, if it helps to have more direct access, feel free to contact me offline. I can set something up. |
| Comment by Peter Chan [ 06/Jun/13 ] |
|
This is the expected setup. The master is listening on port 27017 on MASTER_IP. The slave node has an SSH tunnel that forwards SLAVE_IP:37017 to MASTER_IP:27017, and the slave mongo then connects to the localhost:27017. I ruled out problem with the SSH tunnel, because on one occasion, I have two slaves connecting to the same tunnel, one replicating in real time successfully, the other ones has a slave-delay and was failing with the SocketException (I don't believe the slave delay option is a factor here). All arguments and options are EXACTLY the same. I have a db start script that keeps all options the same, and the only difference is I swap between 2.2.4 and 2.4.3 version of bin/mongod.exe to invoke. Yes, when I use keyFile with 2.2 nodes, it works fine with no SocketException. As noted above, it appears that this particular SocketException only arises when ALL nodes are running 2.4. If either the master or slave is running 2.2, then everything (so far in the last day) runs fine. Currently, I have a 2.2.4 master running, along with 3 x 2.4.3 slaves, using keyFile, and they are replicating fine. Obviously, I want to move the master up to 2.4, assuming we can figure this out. |
| Comment by Spencer Brody (Inactive) [ 06/Jun/13 ] |
|
From the logs you posted, it looks like your master is listening on port 27017, but your secondary is trying to connect to port 37017. When you've been downgrading to 2.2, have you been keeping the same startup options? Were you still using keyFile with 2.2? Did authentication with keyFile work in 2.2? |
| Comment by Peter Chan [ 06/Jun/13 ] |
|
So, here are the logs from the slave and master, both running 2.4.3. The logs are captured while only the master and that single failing slave is running. No other drivers / nodes are connecting. Timestamps should be approximately correct. Log from slave: Wed Jun 05 18:47:50.943 [initandlisten] db version v2.4.3 Wed Jun 05 18:47:50.959 [initandlisten] journal dir=/data/slave-delay\journal Log from master: Wed Jun 05 18:47:37.988 [initandlisten] MongoDB starting : pid=4296 port=27017 dbpath=\datadata\ master=1 64-bit host=master Wed Jun 05 18:47:38.003 [initandlisten] journal dir=\datdata\journal Wed Jun 05 18:48:29.847 [initandlisten] connection accepted from MASTER_IP:30932 #2 (2 connections now open) Wed Jun 05 18:49:03.597 [initandlisten] connection accepted from MASTER_IP:30933 #3 (3 connections now open) Wed Jun 05 18:49:37.800 [initandlisten] connection accepted from MASTER_IP:30935 #4 (4 connections now open) Wed Jun 05 18:49:38.972 [conn1] query local.oplog.$main query: { ts: { $gte: Timestamp 1370420848000|22 } } cursorid:332362094268386 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 6567 locks(micros) r:3683308 nreturned:101 reslen:11750 102609ms Later, I took the chance to downgrade the master to 2.2 while running the slaves at 2.4, and replication still works on all of them, so it appears the problem has to do with the use of keyFile AND the use of 2.4.3 on ALL nodes. |
| Comment by Peter Chan [ 05/Jun/13 ] |
|
Ok, I will post the console log from master when I get a chance to isolate the display. But at first glance, there is no obvious error on the master log. This error stops replication on the affected node. The problem is not visible to application, since this is M/S replication and the master is entirely not affected. I have swapped the slaves back to 2.2.4, which immediately resolves the issue. Once I put on 2.4.3, the problem with the SocketException creeps up again, on 2 of 3 nodes. It is also sporadic. On my Linux node, it happened on 2.4, I swap to 2.2, it works, I swap back to 2.4, and it still currently and at the moment, replicating (but obviously happened before, and is currently still happening on the Windows node on 2.4). It does look like a networking issue, but it seems likely a combination of changes in 2.4 + keyFile, because 1) this particular error didn't happen before the use of keyFile, 2) on the 2 windows slaves, they both share the same underlying SSH tunnel, but one of them is replicating fine, while the other has this SocketException, 3) use of 2.2.4 does not have this issue across multiple restarts. |
| Comment by Spencer Brody (Inactive) [ 05/Jun/13 ] |
|
Also, when you see this error, does replication stop? Are there problems visible to your application, or is this just something you're seeing in the logs? |
| Comment by Spencer Brody (Inactive) [ 05/Jun/13 ] |
|
Hard to tell. At first glance it seems more likely to be a networking issue than an auth issue, but it's weird that you didn't see it before changing to use keyFile... Can you post logs for the same time period from the primary? |
| Comment by Peter Chan [ 05/Jun/13 ] |
|
I tried out the keyFile option and it seems to work and replication proceeds, at least initially. But after a few hours, I am getting this cursor message: Wed Jun 5 17:09:41.267 [replslave] repl: syncing from host:localhost:37017 Again, this is running on 2.4.3 master and 2.4.3 slaves. Of the three slaves setup, one is sync'ing properly, but another two has the above error messages (one on Windows and the other on Linux). Restarting the slave nodes has little (or no effect) on resolving the above error. This seems to be triggered by the use of keyFile and 2.4.3, as I haven't seen it before. Is it related? |
| Comment by Spencer Brody (Inactive) [ 05/Jun/13 ] |
|
Hi Peter, I filed |
| Comment by Peter Chan [ 05/Jun/13 ] |
|
Ok, this requires that I take the production database offline, so I need to schedule for it. It will take me another day or two to report back. In the meantime, can you give a little background on what this is supposed to do? I know keyFile is for ReplicaSet, but is it also used for Master/slave? Is this a one-time operation or do I need to switch to using keyFile for authentication from now on? How does it relate to the previous "repl" user in the local database? |
| Comment by Spencer Brody (Inactive) [ 04/Jun/13 ] |
|
Hi Peter, |
| Comment by Peter Chan [ 03/Jun/13 ] |
|
Spencer, unfortunately, I am not able to upgrade to ReplicaSets at this point, due to operational requirements. Let me know if there is anything I can help with developing a fix / workaround. |
| Comment by Spencer Brody (Inactive) [ 03/Jun/13 ] |
|
Hi Peter, If using ReplicaSets isn't an option for you let me know - there may be a workaround to make auth work with master/slave, but it will require some work to confirm and test it. |
| Comment by Peter Chan [ 03/Jun/13 ] |
|
I managed to, mostly, resolve the operational impact, by downgrading only the slaves to Mongo 2.2.4, and while the master remains at 2.4.3. All three slaves instances (running 2.2.4) are now replicating fine, but I still see sporadic "unauthorized" messages. Here is a sample: Sun Jun 2 03:14:05 [replslave] repl: syncedTo: Jun 2 03:14:11 51aab883:e Sun Jun 2 03:17:27 [replslave] repl: checkpoint applied 3031 operations Sun Jun 2 03:19:44 [replslave] repl: checkpoint applied 2885 operations It appears that the unauthorized message comes up at the beginning of the connection establishment, but the slave can continue after the message. No idea why 2.2.4 slaves can continue, while same slaves running 2.4.3 has only partial success. Obviously this is not a long-term workaround, as eventually I want all mongod to be running 2.4+, not 2.2. |
| Comment by Peter Chan [ 01/Jun/13 ] |
|
Just to add some information: The error seems to come up sporadically. On the one good instance, it comes up once upon startup, then replication continues successfully: Sat Jun 01 10:08:54.310 [replslave] repl: syncing from host:localhost:37017 Sat Jun 01 10:09:55.951 [replslave] repl: checkpoint applied 2880 operations |