[SERVER-9846] Slave replication failure replHandshake unauthorized Created: 01/Jun/13  Updated: 10/Dec/14  Resolved: 10/Jun/13

Status: Closed
Project: Core Server
Component/s: Replication, Security
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Peter Chan Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to DOCS-1571 Update master/slave documentation wit... Closed
related to SERVER-9865 Deprecate and remove code for "repl" ... Closed
Operating System: ALL
Participants:

 Description   

I previously have a database in Mongo 1.8 that was running master/slave replication.

The data store has three databases: local, admin, and appdb

I upgraded directly to Mongo 2.4.3, and I am starting to get sporadic master/slave replication failure. Specifically, I am seeing on the slaves:

Sat Jun 1 14:48:29.702 [replslave] Socket recv() timeout 127.0.0.1:37017
Sat Jun 1 14:48:29.703 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Sat Jun 1 14:48:29.703 [replslave] DBClientCursor::init call() failed
Sat Jun 1 14:48:29.703 [replslave] repl: dbclient::query returns null (conn closed?)
Sat Jun 1 14:48:29.704 [replslave] repl: sleep 3 sec before next pass
Sat Jun 1 14:48:32.710 [replslave] repl: syncing from host:localhost:37017
Sat Jun 1 14:48:33.080 [replslave] replHandshake res not: 0 res:

{ ok: 0.0, errmsg: "unauthorized" }

And on the master:

Sat Jun 01 09:49:39.024 [conn1767] authenticate db: local

{ authenticate: 1, nonce: "XXXXXXXXXX", user: "repl", key: "XXXXXXXXX" }

Sat Jun 01 09:49:39.118 [conn1767] command denied:

{ handshake: ObjectId('XXXXXXXX-changes every time it outputs') }

The error is not always present. I have three slave instances. One is replicating fine with no error messages. The other two would start replication, but at some point encounter the "unauthorized" errmsg.

I tried deleting and re-creating the slave database, including adding the "repl" user to the local database. That works for a while, then the error comes up again.

I tried changing the "repl" user on the master local database, including removing the old "readOnly" field and adding in roles ['read', 'readWrite', 'dbAdmin'], but it still doesn't work.



 Comments   
Comment by auto [ 10/Jun/13 ]

Author:

{u'username': u'stbrody', u'name': u'Spencer T Brody', u'email': u'spencer@10gen.com'}

Message: SERVER-9846 Make master/slave auth test use keyFile
Branch: master
https://github.com/mongodb/mongo/commit/64a1136d69b700d9e7e40e7fd8c43f9a8108ac76

Comment by Peter Chan [ 10/Jun/13 ]

Thank you. Appreciate it.

Comment by Spencer Brody (Inactive) [ 10/Jun/13 ]

To summarize the original master/slave auth issues:

The old way of enabling access control in a master/slave system, of adding a user named "repl" to the local database of all nodes, no longer is valid as of 2.4. The new way to enable auth on a master/slave system is to use --keyFile, just as is done for Replica Sets.

Comment by Spencer Brody (Inactive) [ 10/Jun/13 ]

Hi Peter,
This ticket has diverged a bit from the original problem with authentication. I have created SUPPORT-593 on your behalf which is a private ticket visible only to you and employees of 10gen so that you can go into further investigation of the network issues you are seeing with members of our support team.
I'd like to go ahead and resolve this ticket as the original problems with authentication seem to be resolved by the switch to using --keyFile.
Please let us know if you have any further questions/issues.

Comment by Peter Chan [ 08/Jun/13 ]

Ok, I upgraded the master back to 2.4.3. The problem is now re-occurring on one of the 2.4.3 slaves. It is now producing:

Fri Jun 07 23:04:56.207 [replslave] DBClientCursor::init call() failed
Fri Jun 07 23:04:56.207 [replslave] repl: dbclient::query returns null (conn closed?)
Fri Jun 07 23:04:56.207 [replslave] repl: sleep 3 sec before next pass
Fri Jun 07 23:04:59.207 [replslave] repl: syncing from host:localhost:37017
Fri Jun 07 23:05:30.395 [replslave] Socket recv() timeout 127.0.0.1:37017
Fri Jun 07 23:05:30.395 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Fri Jun 07 23:05:30.395 [replslave] DBClientCursor::init call() failed
Fri Jun 07 23:05:30.395 [replslave] repl: dbclient::query returns null (conn closed?)
Fri Jun 07 23:05:30.395 [replslave] repl: sleep 3 sec before next pass
Fri Jun 07 23:05:33.395 [replslave] repl: syncing from host:localhost:37017
Fri Jun 07 23:06:04.660 [replslave] Socket recv() timeout 127.0.0.1:37017
Fri Jun 07 23:06:04.660 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Fri Jun 07 23:06:04.660 [replslave] DBClientCursor::init call() failed
Fri Jun 07 23:06:04.660 [replslave] repl: dbclient::query returns null (conn closed?)
Fri Jun 07 23:06:04.660 [replslave] repl: sleep 3 sec before next pass
Fri Jun 07 23:06:07.660 [replslave] repl: syncing from host:localhost:37017

Another 2.4.3 slave, on the same server, connected to the same SSH tunnel, is replicating fine (note contemporary timestamps):

Fri Jun 07 23:02:25.520 [replslave] repl: checkpoint applied 1320 operations
Fri Jun 07 23:02:25.520 [replslave] repl: syncedTo: Jun 07 23:02:25 51b2acd1:a
Fri Jun 07 23:03:26.004 [replslave] repl: checkpoint applied 1410 operations
Fri Jun 07 23:03:26.004 [replslave] repl: syncedTo: Jun 07 23:03:25 51b2ad0d:15
Fri Jun 07 23:04:27.098 [replslave] repl: checkpoint applied 1440 operations
Fri Jun 07 23:04:27.098 [replslave] repl: syncedTo: Jun 07 23:04:26 51b2ad4a:11
Fri Jun 07 23:05:28.567 [replslave] repl: checkpoint applied 1440 operations
Fri Jun 07 23:05:28.567 [replslave] repl: syncedTo: Jun 07 23:05:28 51b2ad88:6
Fri Jun 07 23:06:29.785 [replslave] repl: checkpoint applied 1470 operations
Fri Jun 07 23:06:29.785 [replslave] repl: syncedTo: Jun 07 23:06:29 51b2adc5:9

I also connected to the localhost end point of the SSH tunnel with a 2.4 shell, and this is the transcript:

MongoDB shell version: 2.4.3
connecting to: localhost:37017/test
> use admin
switched to db admin
> db.auth('username', 'password123')
1
> db.isMaster()
{
"ismaster" : true,
"maxBsonObjectSize" : 16777216,
"maxMessageSizeBytes" : 48000000,
"localTime" : ISODate("2013-06-08T03:59:40.425Z"),
"ok" : 1
}
> show dbs
admin 0.03125GB
xxxxdb 6.996826171875GB
local 2.0302734375GB
>

Comment by Peter Chan [ 07/Jun/13 ]

Ok. This will take me a day or two, as I need to schedule the master offline and upgrade it back to 2.4.3 from 2.2.4.

Are there any commands you want me to run on the 2.4 shell over the ssh tunnel, if I can establish connection successfully?

I ask because, from the look of the log, the initial connection is fine, but something subsequently happens with the "DBClientCursor::init call() failed" line.

Comment by Spencer Brody (Inactive) [ 07/Jun/13 ]

Can you try connecting to the master with a 2.4 shell from the slave machine using the same hostname/IP that the slave node is using (I believe that's localhost:37017)?

Comment by Peter Chan [ 06/Jun/13 ]

Also, if it helps to have more direct access, feel free to contact me offline. I can set something up.

Comment by Peter Chan [ 06/Jun/13 ]

This is the expected setup. The master is listening on port 27017 on MASTER_IP. The slave node has an SSH tunnel that forwards SLAVE_IP:37017 to MASTER_IP:27017, and the slave mongo then connects to the localhost:27017.

I ruled out problem with the SSH tunnel, because on one occasion, I have two slaves connecting to the same tunnel, one replicating in real time successfully, the other ones has a slave-delay and was failing with the SocketException (I don't believe the slave delay option is a factor here).

All arguments and options are EXACTLY the same. I have a db start script that keeps all options the same, and the only difference is I swap between 2.2.4 and 2.4.3 version of bin/mongod.exe to invoke.

Yes, when I use keyFile with 2.2 nodes, it works fine with no SocketException. As noted above, it appears that this particular SocketException only arises when ALL nodes are running 2.4. If either the master or slave is running 2.2, then everything (so far in the last day) runs fine.

Currently, I have a 2.2.4 master running, along with 3 x 2.4.3 slaves, using keyFile, and they are replicating fine. Obviously, I want to move the master up to 2.4, assuming we can figure this out.

Comment by Spencer Brody (Inactive) [ 06/Jun/13 ]

From the logs you posted, it looks like your master is listening on port 27017, but your secondary is trying to connect to port 37017.

When you've been downgrading to 2.2, have you been keeping the same startup options? Were you still using keyFile with 2.2? Did authentication with keyFile work in 2.2?

Comment by Peter Chan [ 06/Jun/13 ]

So, here are the logs from the slave and master, both running 2.4.3. The logs are captured while only the master and that single failing slave is running. No other drivers / nodes are connecting. Timestamps should be approximately correct.

Log from slave:

Wed Jun 05 18:47:50.943 [initandlisten] db version v2.4.3
Wed Jun 05 18:47:50.943 [initandlisten] git version: fe1743177a5ea03e91e0052fb5e2cb2945f6d95f
Wed Jun 05 18:47:50.943 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
Wed Jun 05 18:47:50.943 [initandlisten] allocator: system
Wed Jun 05 18:47:50.943 [initandlisten] options:

{ auth: true, autoresync: true, dbpath: "/data/slave-delay", journal: true, keyFile: "\keyfile.txt", port: 47018 , slave: true, slavedelay: 43200, smallfiles: true, source: "localhost:37017" }

Wed Jun 05 18:47:50.959 [initandlisten] journal dir=/data/slave-delay\journal
Wed Jun 05 18:47:50.959 [initandlisten] recover : no journal files present, no recovery needed
Wed Jun 05 18:47:50.990 [initandlisten] waiting for connections on port 47018
Wed Jun 05 18:47:50.990 [websvr] admin web console waiting for connections on port 48018
Wed Jun 05 18:47:51.990 [replslave] repl: syncing from host:localhost:37017
Wed Jun 05 18:47:53.006 [replslave] repl: couldn't connect to server localhost:37017
Wed Jun 05 18:47:53.006 [replslave] repl: sleep 3 sec before next pass
Wed Jun 05 18:47:56.006 [replslave] repl: syncing from host:localhost:37017
Wed Jun 05 18:48:26.803 [replslave] Socket recv() timeout 127.0.0.1:37017
Wed Jun 05 18:48:26.803 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Wed Jun 05 18:48:26.803 [replslave] DBClientCursor::init call() failed
Wed Jun 05 18:48:26.803 [replslave] repl: dbclient::query returns null (conn closed?)
Wed Jun 05 18:48:26.803 [replslave] repl: sleep 3 sec before next pass
Wed Jun 05 18:48:29.803 [replslave] repl: syncing from host:localhost:37017
Wed Jun 05 18:49:00.553 [replslave] Socket recv() timeout 127.0.0.1:37017
Wed Jun 05 18:49:00.553 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Wed Jun 05 18:49:00.553 [replslave] DBClientCursor::init call() failed
Wed Jun 05 18:49:00.553 [replslave] repl: dbclient::query returns null (conn closed?)
Wed Jun 05 18:49:00.553 [replslave] repl: sleep 3 sec before next pass
Wed Jun 05 18:49:03.553 [replslave] repl: syncing from host:localhost:37017
Wed Jun 05 18:49:34.334 [replslave] Socket recv() timeout 127.0.0.1:37017
Wed Jun 05 18:49:34.334 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Wed Jun 05 18:49:34.334 [replslave] DBClientCursor::init call() failed
Wed Jun 05 18:49:34.334 [replslave] repl: dbclient::query returns null (conn closed?)
Wed Jun 05 18:49:34.334 [replslave] repl: sleep 3 sec before next pass

Log from master:

Wed Jun 05 18:47:37.988 [initandlisten] MongoDB starting : pid=4296 port=27017 dbpath=\datadata\ master=1 64-bit host=master
Wed Jun 05 18:47:37.988 [initandlisten] db version v2.4.3
Wed Jun 05 18:47:37.988 [initandlisten] git version: fe1743177a5ea03e91e0052fb5e2cb2945f6d95f
Wed Jun 05 18:47:37.988 [initandlisten] build info: windows sys.getwindowsversion(major=6, minor=1, build=7601, platform=2, service_pack='Service Pack 1') BOOST_LIB_VERSION=1_49
Wed Jun 05 18:47:37.988 [initandlisten] allocator: system
Wed Jun 05 18:47:37.988 [initandlisten] options:

{ auth: true, journal: true, keyFile: "\keyfile.txt", master: true, oplogSize: 2000, rest: true, smallfiles: true }

Wed Jun 05 18:47:38.003 [initandlisten] journal dir=\datdata\journal
Wed Jun 05 18:47:38.003 [initandlisten] recover : no journal files present, no recovery needed
Wed Jun 05 18:47:38.050 [initandlisten] waiting for connections on port 27017
Wed Jun 05 18:47:38.050 [websvr] admin web console waiting for connections on port 28017
Wed Jun 05 18:47:56.050 [initandlisten] connection accepted from MASTER_IP:30930 #1 (1 connection now open)
Wed Jun 05 18:47:56.191 [conn1] authenticate db: local

{ authenticate: 1, nonce: "asdfasdfasdfasdf", user: "__system", key: "afdfsfadfasdfasdfasdfsdf" }

Wed Jun 05 18:48:29.847 [initandlisten] connection accepted from MASTER_IP:30932 #2 (2 connections now open)
Wed Jun 05 18:48:29.956 [conn2] authenticate db: local

{ authenticate: 1, nonce: "asdfasdfasdfasdf", user: "__system", key: "9d65c92fsfsfdfsdfsdf51518a79fa4a51e" }

Wed Jun 05 18:49:03.597 [initandlisten] connection accepted from MASTER_IP:30933 #3 (3 connections now open)
Wed Jun 05 18:49:03.738 [conn3] authenticate db: local

{ authenticate: 1, nonce: "asdfasdfasdfasdf", user: "__system", key: "f8094c85adfasdfasdfasdfsdfde1adf75" }

Wed Jun 05 18:49:37.800 [initandlisten] connection accepted from MASTER_IP:30935 #4 (4 connections now open)
Wed Jun 05 18:49:37.941 [conn4] authenticate db: local

{ authenticate: 1, nonce: "asdfasdfasdfasdf", user: "__system", key: "f2d24asfasdfasdfasdfasdfasdfdsfe1c2fe" }

Wed Jun 05 18:49:38.972 [conn1] query local.oplog.$main query: { ts:

{ $gte: Timestamp 1370420848000|22 }

} cursorid:332362094268386 ntoreturn:0 ntoskip:0 nscanned:102 keyUpdates:0 numYields: 6567 locks(micros) r:3683308 nreturned:101 reslen:11750 102609ms
Wed Jun 05 18:49:38.972 [conn1] end connection MASTER_IP:30930 (3 connections now open)

Later, I took the chance to downgrade the master to 2.2 while running the slaves at 2.4, and replication still works on all of them, so it appears the problem has to do with the use of keyFile AND the use of 2.4.3 on ALL nodes.

Comment by Peter Chan [ 05/Jun/13 ]

Ok, I will post the console log from master when I get a chance to isolate the display. But at first glance, there is no obvious error on the master log.

This error stops replication on the affected node. The problem is not visible to application, since this is M/S replication and the master is entirely not affected.

I have swapped the slaves back to 2.2.4, which immediately resolves the issue. Once I put on 2.4.3, the problem with the SocketException creeps up again, on 2 of 3 nodes.

It is also sporadic. On my Linux node, it happened on 2.4, I swap to 2.2, it works, I swap back to 2.4, and it still currently and at the moment, replicating (but obviously happened before, and is currently still happening on the Windows node on 2.4).

It does look like a networking issue, but it seems likely a combination of changes in 2.4 + keyFile, because 1) this particular error didn't happen before the use of keyFile, 2) on the 2 windows slaves, they both share the same underlying SSH tunnel, but one of them is replicating fine, while the other has this SocketException, 3) use of 2.2.4 does not have this issue across multiple restarts.

Comment by Spencer Brody (Inactive) [ 05/Jun/13 ]

Also, when you see this error, does replication stop? Are there problems visible to your application, or is this just something you're seeing in the logs?

Comment by Spencer Brody (Inactive) [ 05/Jun/13 ]

Hard to tell. At first glance it seems more likely to be a networking issue than an auth issue, but it's weird that you didn't see it before changing to use keyFile...

Can you post logs for the same time period from the primary?

Comment by Peter Chan [ 05/Jun/13 ]

I tried out the keyFile option and it seems to work and replication proceeds, at least initially. But after a few hours, I am getting this cursor message:

Wed Jun 5 17:09:41.267 [replslave] repl: syncing from host:localhost:37017
Wed Jun 5 17:09:41.267 [replslave] repl: couldn't connect to server localhost:37017
Wed Jun 5 17:09:41.267 [replslave] repl: sleep 3 sec before next pass
Wed Jun 5 17:09:44.275 [replslave] repl: syncing from host:localhost:37017
Wed Jun 5 17:10:14.607 [replslave] Socket recv() timeout 127.0.0.1:37017
Wed Jun 5 17:10:14.607 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Wed Jun 5 17:10:14.607 [replslave] DBClientCursor::init call() failed
Wed Jun 5 17:10:14.607 [replslave] repl: dbclient::query returns null (conn closed?)
Wed Jun 5 17:10:14.608 [replslave] repl: sleep 3 sec before next pass
Wed Jun 5 17:10:17.615 [replslave] repl: syncing from host:localhost:37017
Wed Jun 5 17:10:47.994 [replslave] Socket recv() timeout 127.0.0.1:37017
Wed Jun 5 17:10:47.995 [replslave] SocketException: remote: 127.0.0.1:37017 error: 9001 socket exception [3] server [127.0.0.1:37017]
Wed Jun 5 17:10:47.995 [replslave] DBClientCursor::init call() failed
Wed Jun 5 17:10:47.995 [replslave] repl: dbclient::query returns null (conn closed?)
Wed Jun 5 17:10:47.995 [replslave] repl: sleep 3 sec before next pass
Wed Jun 5 17:10:51.002 [replslave] repl: syncing from host:localhost:37017

Again, this is running on 2.4.3 master and 2.4.3 slaves. Of the three slaves setup, one is sync'ing properly, but another two has the above error messages (one on Windows and the other on Linux). Restarting the slave nodes has little (or no effect) on resolving the above error.

This seems to be triggered by the use of keyFile and 2.4.3, as I haven't seen it before. Is it related?

Comment by Spencer Brody (Inactive) [ 05/Jun/13 ]

Hi Peter,
I did some further digging, and it turns out that if you use keyFile, you don't need the "repl" user at all. The slaves will authenticate to the master using the keyFile, much like they do in a replica set configuration, and the local db's "repl" user will be ignored. Switching to --keyFile is a configuration setting, so once you've made that change you should be able to leave it alone and from now on your cluster will continue to perform internal authentication using the keyFile. Client authentication (for example from your app or from the shell) should be unaffected - the keyFile is only used for cluster members to authenticate to each other.

I filed DOCS-1571 to update the documentation to reflect this.

Comment by Peter Chan [ 05/Jun/13 ]

Ok, this requires that I take the production database offline, so I need to schedule for it. It will take me another day or two to report back.

In the meantime, can you give a little background on what this is supposed to do? I know keyFile is for ReplicaSet, but is it also used for Master/slave? Is this a one-time operation or do I need to switch to using keyFile for authentication from now on? How does it relate to the previous "repl" user in the local database?

Comment by Spencer Brody (Inactive) [ 04/Jun/13 ]

Hi Peter,
Can you please try restarting all your nodes with --keyFile, giving the contents of the keyfile the same password as the "repl" user you set up in the local database? I believe that should resolve this issue. More information about setting up keyFile is available here:
http://docs.mongodb.org/manual/core/replication/#replica-set-security
and
http://docs.mongodb.org/manual/tutorial/generate-key-file/

Comment by Peter Chan [ 03/Jun/13 ]

Spencer, unfortunately, I am not able to upgrade to ReplicaSets at this point, due to operational requirements.

Let me know if there is anything I can help with developing a fix / workaround.

Comment by Spencer Brody (Inactive) [ 03/Jun/13 ]

Hi Peter,
This seems to be a bug with auth in master/slave. Would it be possible for you to upgrade to using ReplicaSets? This blog post has some discussion about how to do that upgrade: https://wiki.10gen.com/pages/viewpage.action?pageId=29819353#UpgradingtoReplicaSets-UpgradingFromReplicaPairsorMaster%2FSlave.

If using ReplicaSets isn't an option for you let me know - there may be a workaround to make auth work with master/slave, but it will require some work to confirm and test it.

Comment by Peter Chan [ 03/Jun/13 ]

I managed to, mostly, resolve the operational impact, by downgrading only the slaves to Mongo 2.2.4, and while the master remains at 2.4.3.

All three slaves instances (running 2.2.4) are now replicating fine, but I still see sporadic "unauthorized" messages. Here is a sample:

Sun Jun 2 03:14:05 [replslave] repl: syncedTo: Jun 2 03:14:11 51aab883:e
Sun Jun 2 03:15:06 [replslave] repl: checkpoint applied 1155 operations
Sun Jun 2 03:15:06 [replslave] repl: syncedTo: Jun 2 03:15:05 51aab8b9:5
Sun Jun 2 03:16:14 [replslave] repl: AssertionException dbclient error communicating with server: localhost:37017
repl: sleep 2 sec before next pass
Sun Jun 2 03:16:16 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:16:16 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:16:16 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:16:19 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:16:19 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:16:19 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:16:22 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:16:26 [replslave] replHandshake res not: 0 res:

{ ok: 0.0, errmsg: "unauthorized" }

Sun Jun 2 03:17:27 [replslave] repl: checkpoint applied 3031 operations
Sun Jun 2 03:17:27 [replslave] repl: syncedTo: Jun 2 03:17:25 51aab945:a
Sun Jun 2 03:18:32 [replslave] repl: AssertionException dbclient error communicating with server: localhost:37017
repl: sleep 2 sec before next pass
Sun Jun 2 03:18:34 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:18:34 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:18:34 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:18:37 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:18:37 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:18:37 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:18:40 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:18:43 [replslave] replHandshake res not: 0 res:

{ ok: 0.0, errmsg: "unauthorized" }

Sun Jun 2 03:19:44 [replslave] repl: checkpoint applied 2885 operations
Sun Jun 2 03:19:44 [replslave] repl: syncedTo: Jun 2 03:19:42 51aab9ce:5
Sun Jun 2 03:20:45 [replslave] repl: checkpoint applied 1320 operations
Sun Jun 2 03:20:45 [replslave] repl: syncedTo: Jun 2 03:20:44 51aaba0c:9
Sun Jun 2 03:21:46 [replslave] repl: checkpoint applied 1305 operations
Sun Jun 2 03:21:46 [replslave] repl: syncedTo: Jun 2 03:21:44 51aaba48:f
Sun Jun 2 03:22:55 [replslave] repl: AssertionException dbclient error communicating with server: localhost:37017
repl: sleep 2 sec before next pass
Sun Jun 2 03:22:57 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:22:57 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:22:57 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:23:00 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:23:00 [replslave] repl: couldn't connect to server localhost:37017
Sun Jun 2 03:23:00 [replslave] repl: sleep 3 sec before next pass
Sun Jun 2 03:23:03 [replslave] repl: syncing from host:localhost:37017
Sun Jun 2 03:23:06 [replslave] replHandshake res not: 0 res:

{ ok: 0.0, errmsg: "unauthorized" }

It appears that the unauthorized message comes up at the beginning of the connection establishment, but the slave can continue after the message. No idea why 2.2.4 slaves can continue, while same slaves running 2.4.3 has only partial success.

Obviously this is not a long-term workaround, as eventually I want all mongod to be running 2.4+, not 2.2.

Comment by Peter Chan [ 01/Jun/13 ]

Just to add some information:

The error seems to come up sporadically. On the one good instance, it comes up once upon startup, then replication continues successfully:

Sat Jun 01 10:08:54.310 [replslave] repl: syncing from host:localhost:37017
Sat Jun 01 10:08:54.529 [replslave] replHandshake res not: 0 res:

{ ok: 0.0, errmsg: "unauthorized" }

Sat Jun 01 10:09:55.951 [replslave] repl: checkpoint applied 2880 operations
Sat Jun 01 10:09:55.951 [replslave] repl: syncedTo: Jun 01 10:09:55 51aa0ec3:a
Sat Jun 01 10:10:56.920 [replslave] repl: checkpoint applied 1320 operations
Sat Jun 01 10:10:56.920 [replslave] repl: syncedTo: Jun 01 10:10:56 51aa0f00:d
Sat Jun 01 10:11:57.013 [replslave] repl: checkpoint applied 1305 operations
...

Generated at Thu Feb 08 03:21:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.