[SERVER-3296] mongos still attempting to setShardVersion on slave MongDB Created: 20/Jun/11  Updated: 12/Jul/16  Resolved: 06/Jul/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Joachim Assignee: Greg Studer
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu


Attachments: File mongos.log.gz     File mongos.log.gz    
Operating System: ALL
Participants:

 Description   

Despite upgrading to MongoDB 1.8.2, we're still seeing mongos attempt setShardVersion on slave MongDB instances (like those described in SERVER-2961).

Here's what we see in the mongos logs:
Sun Jun 19 17:38:33 [conn627] ns: users.user_badges ClusteredCursor::query ShardConnection had to change attempt: 0
Sun Jun 19 17:38:33 [conn627] setShardVersion failed host[mongo-c01r03s03:27018]

{ errmsg: "not master", ok: 0.0 }

Sun Jun 19 17:38:33 [conn627] Assertion: 10429:setShardVersion failed host[mongo-c01r03s03:27018]

{ errmsg: "not master", ok: 0.0 }

0x5204fa 0x6a15ed 0x6a1152
/usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x12a) [0x5204fa]
/usr/bin/mongos() [0x6a15ed]
/usr/bin/mongos() [0x6a1152]

Here's the output of --version for that mongos instance:
Sun Jun 19 17:45:09 mongos db version v1.8.2, pdfile version 4.5 starting (--help for usage)
Sun Jun 19 17:45:09 git version: 433bbaa14aaba6860da15bd4de8edf600f56501b
Sun Jun 19 17:45:09 build sys info: Linux bs-linux64.10gen.cc 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41

All mongod instances in the cluster (including config server instances) are running 1.8.2.



 Comments   
Comment by Greg Studer [ 06/Jul/11 ]

thanks for the update - warning was a double-check added to the newer version since this was a backport, will only trigger once in non-verbose mode - as it says, it's safe, but we want to know about it. Basically means that you're performing a sharded operation on a non-sharded connection, which is done for getLastError(). In newer versions we'll want to migrate those operations that we can away from this, but the underlying issue you were having checking the version of these non-shard connections should now be fixed.

Comment by Eliot Horowitz (Inactive) [ 04/Jul/11 ]

You only need to do mongos.

Comment by Eliot Horowitz (Inactive) [ 04/Jul/11 ]

The patch is now in the 1.8 nightly.
Can you try that?

Comment by Luc Suryo [ 04/Jul/11 ]

Any update or any patch? the issue is effecting out side pretty badly...
We tried to stepping down the slave, restarted mongos still same issue
thanks

Comment by Greg Studer [ 23/Jun/11 ]

I don't think there's a manual workaround aside from stepping down again to the original host or bouncing mongos - each mongos has a collection of hosts which sticks around for the life of the instance. Reconfiguring your shard to remove the host from the RSet URL may work temporarily, but on a second failover the same could happen to the remaining hosts.

Comment by Joachim [ 23/Jun/11 ]

Are there any steps we can take to fix this manually?

Comment by Greg Studer [ 22/Jun/11 ]

patch is in 1.9 now, reviewing for potential backport

Comment by Greg Studer [ 21/Jun/11 ]

Thanks for the verbose logs, we see what we believe to be the problem, and are working on a patch.

Comment by Greg Studer [ 20/Jun/11 ]

*verbose = true

(not sure if verbose = yes works)

Comment by Greg Studer [ 20/Jun/11 ]

Yes, it will be fine to make the ticket private. That configuration is good, it will show any ReplicaSetMonitor messages.

Comment by Greg Studer [ 20/Jun/11 ]

Something strange seems to occur with replica set monitoring... nothing ever gets updated. Do you have gdb installed on any of these machines? If so, is it possible to get a stack trace of the running threads a few minutes after failure starts happening? If not, can you up the log verbosity for a mongos run, and wait again for the errors?

Comment by Luc Suryo [ 20/Jun/11 ]

10gen team

I will take over from Joachim, so please ask me anything you need
thanks

Comment by Joachim [ 20/Jun/11 ]

I've attached a gzipped copy of the mongos log.

Comment by Joachim [ 20/Jun/11 ]

No, I don't think this was right after a primary/secondary switch. This has been happening continuously, with errors every few seconds on each server running mongos, both previously with 1.8.1 and currently with 1.8.2.

Example of error count samples taken every 2 seconds:
$ while true;do grep -c "AssertionException in process: setShardVersion" mongos.log; sleep 2;done
64692
64692
64692
64692
64693
64693
64693
64693
64693
64693
64693
64694
64694
64694
64694
64695
64695

Output of connPoolStats: http://pastebin.com/Q0mGFgRw
Will include mongos log later.

Comment by Eliot Horowitz (Inactive) [ 20/Jun/11 ]

Also, was this right after a primary/secondary switch?

Comment by Eliot Horowitz (Inactive) [ 20/Jun/11 ]

can you:

  • upload full mongos log
  • send output of db.adminCommand( "connPoolStats" ) while connected to mongos
Generated at Thu Feb 08 03:02:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.