[SERVER-3049] com.mongodb.MongoException: setShardVersion failed host[mongodb04.example.com:27018] { errmsg: "not master", ok: 0.0 } Created: 05/May/11  Updated: 12/Jul/16  Resolved: 12/May/11

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.1
Fix Version/s: 1.8.3, 1.9.1

Type: Bug Priority: Major - P3
Reporter: Michael Conigliaro Assignee: Greg Studer
Resolution: Done Votes: 0
Labels: failed, master, not, setShardVersion
Remaining Estimate: 0 minutes
Time Spent: Not Specified
Original Estimate: 0 minutes
Environment:

Ubuntu LInux on EC2


Attachments: Text File appserver11.log     Text File appserver12.log     Text File failover_to_slave.log    
Issue Links:
Depends
Operating System: Linux
Participants:

 Description   

After replica set failover, I get the following exception in my apps:

com.mongodb.MongoException: setShardVersion failed host[mongodb04.example.com:27018]

{ errmsg: "not master", ok: 0.0 }

Strangely, I do not get this message for every request. It seems to happen about 50% of the time. The other 50% of the requests succeed without errors. I have seen this go on for several minutes (e.g. 15-20). The only way I can resolve it is by failing back to the original master, or by restarting the mongos on the appserver. Reproducing this problem is very simple:

1.) Run "watch -n 1 curl -v http://example.com/something/that/queries/mongodb" on the appserver
2.) Run "rs.stepDown()" on a MongoDB master
3.) Watch your curl command intermittently fail (seemingly) forever
4.) Run "rs.stepDown()" on the new MongoDB master (fail back to original master)
5.) Watch your curl command succeed

Additionally after failover, I see several messages like this in my mongos.log (on the order of 30-40 per second):

Thu May 5 16:34:43 [conn78] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

These go away as soon as I fail back to the original master. I don't know if this is related to the same issue, so I created another ticket for this: https://jira.mongodb.org/browse/SERVER-3040



 Comments   
Comment by Greg Studer [ 08/Jul/11 ]

The fix for this will be in 1.8.3, and should be in the 1.8 nightlies now.

Comment by Raymond Lu [ 08/Jul/11 ]

So has this been backported? If I downloaded mongodb from the downloads page, does it include the fix?

Comment by Greg Studer [ 12/May/11 ]

a fix which we're backporting to 1.8.2 should solve this problem - pretty sure it's a bad combination of old random slave choice behavior with an edge case that allows duplicate replica set nodes...

Comment by Michael Conigliaro [ 09/May/11 ]

I was finally able to fail back to the original master. Here's what that looked like. That insane amount of logging also stopped:

Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn23] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn23] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [WriteBackListener] SocketException: remote: error: 9001 socket exception [0]
Mon May 9 21:02:55 [WriteBackListener] DBClientCursor::init call() failed
Mon May 9 21:02:55 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ writebacklisten: ObjectId('4dc46ef31dc02d61ad0bcce3') }

Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 9 21:02:55 [conn18] SocketException: remote: error: 9001 socket exception [1]
Mon May 9 21:02:55 [conn18] DBClientCursor::init call() failed
Mon May 9 21:02:55 [conn18] warning: Could not get last error.DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ getlasterror: 1 }

Mon May 9 21:02:55 [conn21] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn21] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 9 21:02:55 [conn21] SocketException: remote: error: 9001 socket exception [1]
Mon May 9 21:02:55 [conn21] DBClientCursor::init call() failed
Mon May 9 21:02:55 [conn21] ReplicaSetMonitor::_checkConnection: caught exception mongodb05.example.com:27018 DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ ismaster: 1 }

Mon May 9 21:02:55 [conn21] SocketException: remote: error: 9001 socket exception [0]
Mon May 9 21:02:55 [conn21] DBClientCursor::init call() failed
Mon May 9 21:02:55 [conn21] ReplicaSetMonitor::_checkConnection: caught exception mongodb04.example.com:27018 DBClientBase::findOne: transport error: mongodb04.example.com:27018 query:

{ ismaster: 1 }

Mon May 9 21:02:55 [conn21] SocketException: remote: error: 9001 socket exception [0]
Mon May 9 21:02:55 [conn21] DBClientCursor::init call() failed
Mon May 9 21:02:55 [conn21] ReplicaSetMonitor::_checkConnection: caught exception mongodb04.example.com:27018 DBClientBase::findOne: transport error: mongodb04.example.com:27018 query:

{ ismaster: 1 }

Mon May 9 21:02:55 [conn18] _check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:55 [conn18] trying reconnect to mongodb05.example.com:27018
Mon May 9 21:02:55 BackgroundJob starting: ConnectBG
Mon May 9 21:02:55 [conn18] reconnect mongodb05.example.com:27018 ok
Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: false, secondary: true, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

trying reconnect to mongodb04.example.com:27018
Mon May 9 21:02:55 BackgroundJob starting: ConnectBG
Mon May 9 21:02:55 [conn18] reconnect mongodb04.example.com:27018 ok
Mon May 9 21:02:55 [conn18] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018

{ setName: "2", ismaster: false, secondary: true, hosts: [ "mongodb04.example.com:27018", "mongodb05.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], primary: "mongodb05.example.com:27018", maxBsonObjectSize: 16777216, ok: 1.0 }

trying reconnect to mongodb04.example.com:27018
Mon May 9 21:02:55 BackgroundJob starting: ConnectBG
Mon May 9 21:02:55 [conn18] reconnect mongodb04.example.com:27018 ok
Mon May 9 21:02:56 [WriteBackListener] MessagingPort recv() errno:104 Connection reset by peer 10.120.14.79:27018
Mon May 9 21:02:56 [WriteBackListener] SocketException: remote: error: 9001 socket exception [1]
Mon May 9 21:02:56 [WriteBackListener] DBClientCursor::init call() failed
Mon May 9 21:02:56 [WriteBackListener] WriteBackListener exception : DBClientBase::findOne: transport error: mongodb05.example.com:27018 query:

{ writebacklisten: ObjectId('4dc46ef31dc02d61ad0bcce3') }

Mon May 9 21:02:56 [conn23] ReplicaSetMonitor::_checkConnection: mongodb05.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb05.example.com:27018", "mongodb04.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

_check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018
Mon May 9 21:02:57 [LockPinger] dist_lock pinged successfully for: appserver11:1304719091:1804289383
Mon May 9 21:02:57 BackgroundJob starting: ConnectBG
Mon May 9 21:02:57 BackgroundJob starting: ConnectBG
Mon May 9 21:02:57 BackgroundJob starting: ConnectBG
Mon May 9 21:02:58 [WriteBackListener] creating new connection to:mongodb05.example.com:27018
Mon May 9 21:02:58 BackgroundJob starting: ConnectBG
Mon May 9 21:02:58 BackgroundJob starting: ConnectBG
Mon May 9 21:03:14 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb10.example.com:27018

{ setName: "4", ismaster: true, secondary: false, hosts: [ "mongodb10.example.com:27018", "mongodb11.example.com:27018" ], arbiters: [ "mongodb12.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 1
Mon May 9 21:03:14 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb01.example.com:27018

{ setName: "1", ismaster: true, secondary: false, hosts: [ "mongodb01.example.com:27018", "mongodb02.example.com:27018" ], arbiters: [ "mongodb03.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 2
Mon May 9 21:03:14 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb04.example.com:27018

{ setName: "2", ismaster: true, secondary: false, hosts: [ "mongodb04.example.com:27018", "mongodb05.example.com:27018" ], arbiters: [ "mongodb06.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 3
Mon May 9 21:03:14 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: mongodb07.example.com:27018

{ setName: "3", ismaster: true, secondary: false, hosts: [ "mongodb07.example.com:27018", "mongodb08.example.com:27018" ], arbiters: [ "mongodb09.example.com:27018" ], maxBsonObjectSize: 16777216, ok: 1.0 }

checking replica set: 4
Mon May 9 21:03:16 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Mon May 9 21:03:17 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Mon May 9 21:03:18 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Mon May 9 21:03:19 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Mon May 9 21:03:19 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }

Mon May 9 21:03:21 [WriteBackListener] writebacklisten result:

{ noop: true, ok: 1.0 }
Comment by Greg Studer [ 09/May/11 ]

i'm investigating that duplicate now - if you have a chance, can you send additional mongos logs for the period before and after that duplicate host is added?

Comment by Michael Conigliaro [ 09/May/11 ]

No chunks should have migrated. I actually have that disabled at the moment, since we experienced performance problems with it in the past.

I've actually never noticed that duplicate host in there. When I look back in the logs, the only time I see that message with the duplicate host is immediately after a failover. I just restarted the mongos on one machine, and the message looked like this instead:

Mon May 9 20:33:18 [conn1] _check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018

Comment by Greg Studer [ 09/May/11 ]

has a chunk recently migrated to/from that rs? There's an issue where processing a migration (all at once) can put a lot of load on the slave.

Also, this line of your mongos logs -
Mon May 9 18:58:56 [conn960] _check : 2/mongodb05.example.com:27018,mongodb04.example.com:27018,mongodb04.example.com:27018 - does the duplicate mongodb04 always appear? Or does it start appearing after the stepdown?

Comment by Michael Conigliaro [ 09/May/11 ]

Additionally, I see that the replica set slave delay is slowly climbing (1688 seconds and counting), so I can't fail back to the original master, because it's too far behind. There are no errors in the mongodb.log on either the master or slave. Here's en example of what I'm looking at. Everything seems normal to me.

Mon May 9 19:48:22 [dur] lsn set 213147014
Mon May 9 19:48:57 [conn43108] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef36c891bb10035067e') }

reslen:60 300006ms
Mon May 9 19:48:57 [conn43107] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef21cdc26a42769eebf') }

reslen:60 300008ms
Mon May 9 19:48:59 [conn43125] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef3849f0a1c88287d78') }

reslen:60 300005ms
Mon May 9 19:48:59 [conn43122] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef3b3948f87df7f48ac') }

reslen:60 300008ms
Mon May 9 19:48:59 [conn43124] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef3eaeb10781fbe5b24') }

reslen:60 300006ms
Mon May 9 19:48:59 [conn43120] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef35a579354135266ba') }

reslen:60 300003ms
Mon May 9 19:48:59 [conn43121] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef3a8567aa60347bcd4') }

reslen:60 300006ms
Mon May 9 19:48:59 [conn43126] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef3944c9431f24cc4d0') }

reslen:60 300005ms
Mon May 9 19:48:59 [conn43128] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef32c217a86054ca7f2') }

reslen:60 300005ms
Mon May 9 19:48:59 [conn43123] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef380f90945682bc23b') }

reslen:60 300004ms
Mon May 9 19:48:59 [conn43127] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef31dc02d61ad0bcce3') }

reslen:60 300007ms
Mon May 9 19:48:59 [conn43129] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dc46ef386adf40920898eb0') }

reslen:60 300005ms
Mon May 9 19:49:16 [dur] lsn set 213178119
Mon May 9 19:50:17 [dur] lsn set 213209004
Mon May 9 19:50:17 [LockPinger] dist_lock pinged successfully for: mongodb04:1304633912:16024625

EDIT: I take it back. There are some errors in the slave about a cusror now. I've seen these before, but I don't know if they mean anything:

Mon May 9 19:50:37 [replica set sync] example.userEventsJournal warning: cursor loc null does not match byLoc position a:711ba56c !
Mon May 9 19:51:19 [dur] lsn set 213240040
Mon May 9 19:51:27 [replica set sync] example.userEventsJournal warning: cursor loc null does not match byLoc position 7:6bc64614 !
Mon May 9 19:51:32 [conn43467] query admin.$cmd ntoreturn:1 command:

{ serverStatus: 1.0 }

reslen:1643 101ms
Mon May 9 19:51:40 [replica set sync] example.userEventsJournal warning: cursor loc null does not match byLoc position 7:2a4c2130 !
Mon May 9 19:51:48 [replica set sync] example.userEventsJournal warning: cursor loc null does not match byLoc position 9:b76d398 !
Mon May 9 19:52:28 [dur] lsn set 213271095
Mon May 9 19:52:57 [replica set sync] example.userEventsJournal warning: cursor loc null does not match byLoc position 8:27fe053c !

EDIT: Just for the record, here's the output of rs.status(). Nothing wrong there that I can see...

2:PRIMARY> rs.status()
{
"set" : "2",
"date" : ISODate("2011-05-09T19:56:29Z"),
"myState" : 1,
"members" : [
{
"_id" : 0,
"name" : "mongodb04.example.com:27018",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 873,
"optime" :

{ "t" : 1304968941000, "i" : 520 }

,
"optimeDate" : ISODate("2011-05-09T19:22:21Z"),
"lastHeartbeat" : ISODate("2011-05-09T19:56:28Z")
},
{
"_id" : 1,
"name" : "mongodb05.example.com:27018",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"optime" :

{ "t" : 1304970988000, "i" : 2 }

,
"optimeDate" : ISODate("2011-05-09T19:56:28Z"),
"self" : true
},
{
"_id" : 2,
"name" : "mongodb06.example.com:27018",
"health" : 1,
"state" : 7,
"stateStr" : "ARBITER",
"uptime" : 873,
"optime" :

{ "t" : 0, "i" : 0 }

,
"optimeDate" : ISODate("1970-01-01T00:00:00Z"),
"lastHeartbeat" : ISODate("2011-05-09T19:56:28Z")
}
],
"ok" : 1
}

EDIT: Shard status looks OK to me too...

> db.printShardingStatus()
— Sharding Status —
sharding version:

{ "_id" : 1, "version" : 3 }

shards:
{
"_id" : "2",
"host" : "2/mongodb05.example.com:27018,mongodb04.example.com:27018"
}
{
"_id" : "1",
"host" : "1/mongodb01.example.com:27018,mongodb02.example.com:27018"
}
{
"_id" : "3",
"host" : "3/mongodb07.example.com:27018,mongodb08.example.com:27018"
}
{
"_id" : "4",
"host" : "4/mongodb10.example.com:27018,mongodb11.example.com:27018"
}
databases:

{ "_id" : "admin", "partitioned" : false, "primary" : "config" } { "_id" : "test", "partitioned" : false, "primary" : "2" } { "_id" : "example", "partitioned" : true, "primary" : "2" }

example.stats chunks:
4 22
3 24
1 22
2 23
too many chunksn to print, use verbose if you want to force print
example.userEventsJournal chunks:
4 194
1 202
3 189
2 172
too many chunksn to print, use verbose if you want to force print
example.userfile chunks:
4 2
2 5
{ "userId" :

{ $minKey : 1 }

} -->>

{ "userId" : "000009a6-02e5-4277-90a3-fb38264440f9" }

on : 4

{ "t" : 7000, "i" : 0 }

{ "userId" : "000009a6-02e5-4277-90a3-fb38264440f9" }

-->>

{ "userId" : "3295759f-14d4-46b3-bcac-22e6b8c35400" }

on : 2

{ "t" : 8000, "i" : 2 }

{ "userId" : "3295759f-14d4-46b3-bcac-22e6b8c35400" }

-->>

{ "userId" : "651a8cb3-7df8-43bf-bbee-6c58a1f88b28" }

on : 2

{ "t" : 8000, "i" : 4 }

{ "userId" : "651a8cb3-7df8-43bf-bbee-6c58a1f88b28" }

-->>

{ "userId" : "97cf4d36-c30d-4888-bd26-6a3ef8dcb662" }

on : 2

{ "t" : 8000, "i" : 6 }

{ "userId" : "97cf4d36-c30d-4888-bd26-6a3ef8dcb662" }

-->>

{ "userId" : "ca921ed8-67bb-44e7-a802-8c48a8ae8c8d" }

on : 2

{ "t" : 8000, "i" : 8 }

{ "userId" : "ca921ed8-67bb-44e7-a802-8c48a8ae8c8d" }

-->>

{ "userId" : "testuser2" }

on : 2

{ "t" : 8000, "i" : 9 }

{ "userId" : "testuser2" }

-->> { "userId" :

{ $maxKey : 1 }

} on : 4

{ "t" : 8000, "i" : 0 } { "_id" : "local", "partitioned" : false, "primary" : "2" }

EDIT: Looks like I'm IO-bound on the slave, so that would explain why it can't catch up. Though I have no idea what the slave is doing to cause all that IO...

Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util
sda1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdb 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
sdh 0.00 284.08 162.19 329.35 4187.06 7657.71 24.10 19.39 38.89 2.02 99.50

Comment by Michael Conigliaro [ 09/May/11 ]

This is appearing in some of my other logs. That little piece of a stack trace looks scary:

Mon May 9 19:24:21 [conn22] going to retry checkShardVersion
Mon May 9 19:24:21 [conn22] DBConfig unserialize: example

{ _id: "example", partitioned: true, primary: "2" }

Mon May 9 19:24:21 [conn22] setShardVersion 2 mongodb04.example.com:27018 example.userfile

{ setShardVersion: "example.userfile", configdb: "mongodb-config03.example.com:27019,mongodb-config02.example.co...", version: Timestamp 8000|9, serverID: ObjectId('4dc46ef31dc02d61ad0bcce3'), authoritative: true, shard: "2", shardHost: "2/mongodb05.example.com:27018,mongodb04.example.com:27018" }

0x7fbfa0002750
Mon May 9 19:24:21 [conn22] setShardVersion failed!

{ errmsg: "not master", ok: 0.0 }

Mon May 9 19:24:21 [conn22] setShardVersion failed host[mongodb04.example.com:27018]

{ errmsg: "not master", ok: 0.0 }

Mon May 9 19:24:21 [conn22] Assertion: 10429:setShardVersion failed host[mongodb04.example.com:27018]

{ errmsg: "not master", ok: 0.0 }

0x51fc59 0x69ca53 0x69c5e2
/usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x129) [0x51fc59]
/usr/bin/mongos() [0x69ca53]
/usr/bin/mongos() [0x69c5e2]
Mon May 9 19:24:21 [conn22] AssertionException in process: setShardVersion failed host[mongodb04.example.com:27018]

{ errmsg: "not master", ok: 0.0 }
Comment by Michael Conigliaro [ 09/May/11 ]

OK, I reproduced the problem. The log file is ridiculous due to this problem: https://jira.mongodb.org/browse/SERVER-3040

Comment by Greg Studer [ 06/May/11 ]

been there . no worries, but yes, definitely send us the logs when/if you see the behavior again - one thing to look out for may be an unresponsive member of the set after stepdown when your queries fail intermittently

Comment by Michael Conigliaro [ 06/May/11 ]

And of course, now I can't reproduce the problem, even though I could reproduce it easily yesterday afternoon. The only thing that's changed is that I truncated all the mongos logs before restarting all the mongos processes. I wonder if they need to be running for a long time before I can get them stuck in that state. Anyway, I'll keep trying...

Comment by Michael Conigliaro [ 06/May/11 ]

Yes, this is a web app with a RESTful interface that communicates with a local mongos instance that points to 4 replica set shards. We're using a Sinatra-like framework called BlueEyes for Scala which was actually developed in-house, so there may be issues with the way we're using the MongoDB driver (I know I've found some issues with what they were doing in the past). Here's a link to the source for reference:

https://github.com/jdegoes/blueeyes/tree/master/src/main/scala/blueeyes/persistence/mongo

I'll try reproducing this again with more log output...

Comment by Greg Studer [ 06/May/11 ]

can't reproduce here, it would be really helpful to see the full logs from the time period slightly before you stepDown() until you run the second stepDown() to restore the master.

also, is your app doing writes or reads to the db, and is slaveOk set? I assume your setup is an app connected to a mongos instance with replica set shards...

Generated at Thu Feb 08 03:01:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.