|
Great!
|
|
Checked this today, works as you described.
|
|
Will check this on Monday, thanks
|
|
Can you give it a try? It'll take a few seconds but will fail over:
> db.items.find()
{ "_id" : ObjectId("4d35bea8ba15dc15b0d3878e"), "value" : 123456789012345 }
{ "_id" : ObjectId("4d35c228abfec6a6bac2d04b"), "value" : 98 }
> // killed servers
> db.items.find()
error:
{ "$err" : "ReplicaSetMonitor no master found for set: foo", "code" : 10009 }
> db.items.find()
error: {
"$err" : "mongos connectionpool: connect failed foo/ubuntu:27017,ubuntu:27019,ubuntu:27018 : connect failed to set foo/ubuntu:27017,ubuntu:27019,ubuntu:27018",
"code" : 11002
}
> db.items.find()
error:
{
"$err" : "mongos connectionpool: connect failed foo/ubuntu:27017,ubuntu:27019,ubuntu:27018 : connect failed to set foo/ubuntu:27017,ubuntu:27019,ubuntu:27018",
"code" : 11002
}
> db.items.find()
{ "_id" : ObjectId("4d35bea8ba15dc15b0d3878e"), "value" : 123456789012345 }
{ "_id" : ObjectId("4d35c228abfec6a6bac2d04b"), "value" : 98 }
|
|
Author:
{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}
Message: rs throws out master on connection exception SERVER-2377
https://github.com/mongodb/mongo/commit/f0eb24f2245d438a47b677a3c33d3c02c73f7097
|
|
You are correct. Will fix in a sec... (there's a simpler fix)
|
|
Below is the version of the patch with the proposed loop (please take note that slave case is completely untested and I removed the "if ( h != _slaveHost )" check in the slave version.
This leads to the following session after killing the primary:
With this patch, after killing primary I get the following on the mongos client:
> db.items.find()
error:
{
"$err" : "dbclient error communicating with server: celestine-3:27100",
"code" : 10278
}
> db.items.find()
{ "_id" : ObjectId("4d384a2061befa37946181af"), "v" : 19959 }
...
has more
— client/dbclient_rs.cpp.orig 2011-01-21 18:06:33.478345002 +0300
+++ client/dbclient_rs.cpp 2011-01-21 18:40:54.338344995 +0300
@@ -365,9 +365,26 @@
_monitor->notifyFailure( _masterHost );
}
- _masterHost = _monitor->getMaster();
- _master.reset( new DBClientConnection( true ) );
- _master->connect( _masterHost );
+ size_t attempts = _monitor->getNodesCount();
+ while (true) {
+ _masterHost = _monitor->getMaster();
+ _master.reset( new DBClientConnection( true ) );
+ try
+
{
+ _master->connect( _masterHost );
+ }
+ catch (ConnectException&)
+
Unknown macro: {+ _monitor->notifyFailure( _masterHost );+ if (attempts > 0)+ {
+ --attempts;
+ continue;
+ }+ throw;+ }
+ break;
+ }
_auth( _master.get() );
return _master.get();
}
@@ -379,13 +396,27 @@
_monitor->notifySlaveFailure( _slaveHost );
}
- HostAndPort h = _monitor->getSlave();
- if ( h != _slaveHost ) {
- _slaveHost = h;
+ size_t attempts = _monitor->getNodesCount();
+ while (true)
{
+ _slaveHost = _monitor->getSlave();
_slave.reset( new DBClientConnection( true ) );
- _slave->connect( _slaveHost );
- _auth( _slave.get() );
- }
+ try
+
{
+ _slave->connect( _slaveHost );
+ }
+ catch (ConnectException&)
+
Unknown macro: {+ _monitor->notifySlaveFailure( _slaveHost );+ if (attempts > 0)+ {
+ --attempts;
+ continue;
+ }+ throw;+ }
+ break;
+ }
+ _auth( _slave.get() );
return _slave.get();
}
— client/dbclient_rs.h.orig 2011-01-21 18:27:11.490345008 +0300
+++ client/dbclient_rs.h 2011-01-21 18:27:46.702345006 +0300
@@ -83,6 +83,8 @@
string getServerAddress() const;
+ size_t getNodesCount() const
{ return _nodes.size(); }
+
private:
/**
- This populates a list of hosts from the list of seeds (discarding the
|
|
I explored client/dbclient_rs.cpp a bit and found out that _nodes[..].ok won't get set to false in my scenario (connection errors aren't handled at all and ReplicaSetMonitorWatcher job executes only once for each replica set).
I concluded a small patch (mostly sure that it is a wrong way to solve this problem, but at least it appears to work, also I don't know if slaves should be handled in this way).
With this patch, after killing primary I get the following on the mongos client:
> db.items.find()
error: {
"$err" : "dbclient error communicating with server: celestine-2:27100",
"code" : 10278
}
> db.items.find()
error:
{
"$err" : "mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100",
"code" : 11002
}
> db.items.find()
{ "_id" : ObjectId("4d384a2061befa37946181af"), "v" : 19959 }
...
has more
So, looks like error does not become permanent after failure. I'm sure this can be improved, so there is at most one error per server failure (adding some kind of a loop in DBClientReplicaSet::checkMaster() that gives up after N attempts, where N is ReplicaSetMonitor._nodes.size(), might be feasible).
— client/dbclient_rs.cpp.orig 2011-01-21 18:06:33.478345002 +0300
+++ client/dbclient_rs.cpp 2011-01-21 17:58:03.874345000 +0300
@@ -367,7 +367,15 @@
_masterHost = _monitor->getMaster();
_master.reset( new DBClientConnection( true ) );
- _master->connect( _masterHost );
+ try
+
{
+ _master->connect( _masterHost );
+ }
+ catch (ConnectException&)
+
{
+ _monitor->notifyFailure( _masterHost );
+ throw;
+ }
_auth( _master.get() );
return _master.get();
}
@@ -383,7 +391,15 @@
if ( h != _slaveHost ) {
_slaveHost = h;
_slave.reset( new DBClientConnection( true ) );
- _slave->connect( _slaveHost );
+ try
+
{
+ _slave->connect( _slaveHost );
+ }
+ catch (ConnectException&)
+
{
+ _monitor->notifySlaveFailure( _slaveHost );
+ throw;
+ }
_auth( _slave.get() );
}
return _slave.get();
|
|
Looks like original problem isn't solved.
Built from git master (c31c7f322943034bd294993e832f51383799ee5f) mongos still behaves in the same way:
xm@celestine-1:~$ ./db/bin/mongo celestine-1/test1
MongoDB shell version: 1.7.5-pre-
connecting to: celestine-1/test1
> db.items.find()
{ "_id" : ObjectId("4d384a2061befa37946181af"), "v" : 19959 }
...
has more
— kill primary here —
> db.items.find()
error: {
"$err" : "dbclient error communicating with server: celestine-2:27100",
"code" : 10278
}
> db.items.find()
error: {
"$err" : "mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100",
"code" : 11002
}
> db.items.find()
error: {
"$err" : "mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100",
"code" : 11002
}
Mongos log:
Fri Jan 21 14:21:20 /home/xm/db/bin/mongos db version v1.7.5-pre-, pdfile version 4.5 starting (--help for usage)
Fri Jan 21 14:21:20 git version: nogitversion
Fri Jan 21 14:21:20 sys info: Linux celestine-1 2.6.35-22-generic-pae #34~lucid1-Ubuntu SMP Mon Oct 11 14:53:39 UTC 2010 i686 BOOST_LIB_VERSION=1_40
Fri Jan 21 14:21:20 SyncClusterConnection connecting to [celestine-1:27200]
Fri Jan 21 14:21:20 SyncClusterConnection connecting to [celestine-2:27200]
Fri Jan 21 14:21:20 SyncClusterConnection connecting to [celestine-3:27200]
Fri Jan 21 14:21:20 [mongosMain] waiting for connections on port 27017
Fri Jan 21 14:21:20 [Balancer] about to contact config servers and shards
Fri Jan 21 14:21:20 [websvr] web admin interface listening on port 28017
Fri Jan 21 14:21:20 [Balancer] updated set (testRS) to: testRS/celestine-1:27100,celestine-3:27100
Fri Jan 21 14:21:20 [Balancer] updated set (testRS) to: testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100
Fri Jan 21 14:21:20 [Balancer] config servers and shards contacted successfully
Fri Jan 21 14:21:20 [Balancer] balancer id: celestine-1:27017 started at Jan 21 14:21:20
Fri Jan 21 14:21:20 [LockPinger] SyncClusterConnection connecting to [celestine-1:27200]
Fri Jan 21 14:21:20 [LockPinger] SyncClusterConnection connecting to [celestine-2:27200]
Fri Jan 21 14:21:20 [Balancer] SyncClusterConnection connecting to [celestine-1:27200]
Fri Jan 21 14:21:20 [LockPinger] SyncClusterConnection connecting to [celestine-3:27200]
Fri Jan 21 14:21:20 [Balancer] SyncClusterConnection connecting to [celestine-2:27200]
Fri Jan 21 14:21:20 [Balancer] SyncClusterConnection connecting to [celestine-3:27200]
Fri Jan 21 14:21:28 [mongosMain] connection accepted from 192.168.7.241:35334 #1
Fri Jan 21 14:21:32 [conn1] creating WriteBackListener for: testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100,celestine-3:27100,celestine-2:27100
Fri Jan 21 14:21:40 checking replica set: testRS
Fri Jan 21 14:21:50 [Balancer] ~ScopedDbConnection: _conn != null
Fri Jan 21 14:21:50 [Balancer] caught exception while doing balance: mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100
Fri Jan 21 14:22:20 [Balancer] ~ScopedDbConnection: _conn != null
Fri Jan 21 14:22:20 [Balancer] caught exception while doing balance: mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100
Fri Jan 21 14:22:50 [Balancer] ~ScopedDbConnection: _conn != null
Fri Jan 21 14:22:50 [Balancer] caught exception while doing balance: mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100
Fri Jan 21 14:22:51 [LockPinger] SyncClusterConnection connecting to [celestine-1:27200]
Fri Jan 21 14:22:51 [LockPinger] SyncClusterConnection connecting to [celestine-2:27200]
Fri Jan 21 14:22:51 [LockPinger] SyncClusterConnection connecting to [celestine-3:27200]
Fri Jan 21 14:23:00 [conn1] end connection 192.168.7.241:35334
|
|
It should now find the new master.
|
|
Author:
{u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}
Message: make c++ client check that rs master is ok before returning it SERVER-2377
https://github.com/mongodb/mongo/commit/a756cf2375574732d77479fe3a4ed1c4138feb4e
|
|
My fault for the setSlaveOk() part, yes. I thought (don't know why, docs aren't saying that) that mongos defaults to reading from slaves.
|
|
Only part of the problem, but
{ "$err" : "not master and slaveok=false", "code" : 13435 }
just means you're trying to query a slave without saying db.getMongo().setSlaveOk() first. Try:
> db.getMongo().setSlaveOk()
> db.items.find()
once all of the servers are up again.
|
|
~$ ./db/bin/mongo celestine-1/test1
MongoDB shell version: 1.7.5-pre-
connecting to: localhost/test1
> db.items.find()
error:
{ "$err" : "not master and slaveok=false", "code" : 13435 }
~$ ./db/bin/mongo celestine-1:27100
MongoDB shell version: 1.7.5-pre-
connecting to: celestine-1:27100/test
testRS:PRIMARY> rs.status()
{
"set" : "testRS",
"date" : ISODate("2011-01-20T08:27:40Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "celestine-1:27100",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :
{
"t" : 1295443467000,
"i" : 1
}
,
"optimeDate" : ISODate("2011-01-19T13:24:27Z"),
"self" : true
},
{
"_id" : 1,
"name" : "celestine-2:27100",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 58665,
"optime" :
{
"t" : 1295443467000,
"i" : 1
}
,
"optimeDate" : ISODate("2011-01-19T13:24:27Z"),
"lastHeartbeat" : ISODate("2011-01-20T08:27:39Z")
},
{
"_id" : 2,
"name" : "celestine-3:27100",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 58668,
"optime" :
{
"t" : 1295443467000,
"i" : 1
}
,
"optimeDate" : ISODate("2011-01-19T13:24:27Z"),
"lastHeartbeat" : ISODate("2011-01-20T08:27:39Z")
}
],
"ok" : 1
}
|
|
I'm a little confused about what state things were in at various error messages.
When you got:
error:
{ "$err" : "not master and slaveok=false", "code" : 13435 }
What state was the replica set in?
|
Generated at Thu Feb 08 02:59:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.