[SERVER-6974] Get Error while reading "Database could not initialize cursor across all shards because : stale config detected for ns:" Created: 08/Sep/12  Updated: 15/Feb/13  Resolved: 01/Oct/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: jitendra Assignee: Greg Studer
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux OEL


Attachments: File mongod_10000.log_2     File mongod_10000.log_2     Text File mongos_35000.log     Text File mongos_35000.log    
Issue Links:
Duplicate
duplicates SERVER-7135 reset per-connection shard info in mo... Closed
Operating System: ALL
Participants:

 Description   

Hi,
Setup 1:Mongos
3:Config server
10: Mongodb with journal enable

While reading from mongos we got error "Fri Sep 7 16:50:08 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0"

Log :

Fri Sep 7 16:50:08 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0
Fri Sep 7 16:50:08 [conn1122] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0
Fri Sep 7 16:50:08 [conn1122] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 1
Fri Sep 7 16:50:08 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 1
Fri Sep 7 16:50:09 [conn1122] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 2
Fri Sep 7 16:50:09 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 2
Fri Sep 7 16:50:11 [conn1123] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0
Fri Sep 7 16:50:11 [conn1124] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0
Fri Sep 7 16:50:11 [conn1123] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 1
Fri Sep 7 16:50:11 [conn1124] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 1

I have following question are:
1. Why does above error come??
2. does it take more then 3 second ?

How to resolve this error
Please help me
Regards,
JRV



 Comments   
Comment by Greg Studer [ 01/Oct/12 ]

Opened a new ticket for the config refresh improvement, follow there for any new updates.

Comment by Greg Studer [ 21/Sep/12 ]

So, here's what happened (at least at :33) -

You had a pooled, incoming connection to mongos which validated a version to shard0000:

Fri Sep  7 11:41:35 [conn816]     setShardVersion  shard0000 192.168.11.103:10000  00070912.Database  { setShardVersion: "00070912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" } 0x1477ab0

Next, for some reason, shard0000 went down (also, you should really double-check your mongod NUMA settings, as that can wreak all kinds of havok, and upgrade to a supported, non-rc release):

Fri Sep  7 11:53:40 dbexit: really exiting now
***** SERVER RESTARTED *****

Mongos cleanly detected this later and re-established a connection, but the version info for the connection on shard0000 got erased.

Fri Sep  7 11:45:56 [conn816] trying reconnect to 192.168.11.103:10000
Fri Sep  7 11:45:56 [conn816] reconnect 192.168.11.103:10000 ok

Mongos doesn't know the version info is erased, though, because it refreshes this info lazily. Therefore when you re-use the conn, mongos is confused and refreshes the version more frequently than it otherwise would:

Fri Sep  7 16:33:05 [conn816] ns: 00070912.Database could not initialize cursor across all shards because : stale config detected for ns: 00070912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 1
Fri Sep  7 16:33:06 [conn816] ns: 00070912.Database could not initialize cursor across all shards because : stale config detected for ns: 00070912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 2
Fri Sep  7 16:33:08 [conn816] warning: chunk manager reload forced for collection '00070912.Database', config version is 28|1

This is all works-as-designed, the problem comes in that there is a backoff delay in mongos when refreshing a version multiple times (to avoid slamming the config servers and mongod). We're working to improve config refresh in general, but this type of staleness is inherent to the system.

Workarounds include the suggestions above (maxConns in particular will avoid the connection slamming if this does happen), running flushRouterConfig on mongoses after mongod failures (which will force version reloads), or if you're inclined to recompile your own mongo version you can change the hardcoded delays.

Comment by jitendra [ 21/Sep/12 ]

Hi Adding mongos and mongod logs

I had problem at following time

1. Fri Sep 7 16:33:05
2. Fri Sep 7 16:50:08

Comment by Greg Studer [ 20/Sep/12 ]

So, a few things:

We really do need to get more details to drill down into where the problem you're having is coming from. We have one mongos log, but that log is not for the time period when you had this problem:

Fri Sep 7 16:50:09 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 2

> cat mongos_35000.log | grep "attempt: 2" | wc -l
0
> cat mongos_35000.log | grep "Fri Sep 7" | wc -l
0

It's hard to say what the multiple retries were for here, which may or may not be causing delays, without the mongos logs during these periods. The mongos logs you provide don't have any similar multiple-retry portions. As you mention though, if you can get similar mongod and mongos logs during one of these time periods (and let me know the time of the problem) we're happy to look again.

Secondly, for your connection spikes and high fd use - you can prevent mongos from opening more than a certain number of incoming connections - if your app tries to exceed this limit, mongos will not allow it. You do this by setting --maxConns on the mongos command line.

mongos --maxConns 200 (some number greater than the maximum number of connections your app should ever open)

This should prevent the high connection issues, the maximum number of connections open to each mongod will be (roughly) 200 * (# of mongoses) (also a few fds will be used for your databases).

If there is some problem between the app and mongos causing the high connections, we'll see it much sooner and the mongods won't get overloaded and slow down everybody else.

Comment by jitendra [ 20/Sep/12 ]

Hi please update.

Thanks,
JRV

Comment by jitendra [ 20/Sep/12 ]

Hi ,

1. "Mongos logs at logLevel 2 would show the actual retry process."
> We start mongod and mongos with --vv option, than means logLevel 2 the log we provided you, that was logLevel 2.
what logLevel do you required.

2 "See your out-of-file-descriptor messages - 35000 fds is quite a lot of fds to use."
> We had 35000 file-descriptors limit.Now we have increased file-descriptors limit from 35000 to 60000.
please refer link : "https://groups.google.com/forum/?fromgroups=#!topicsearchin/mongodb-user/File$20Descriptor$20Configuration$20for$20MongoDB$20Deployment$20/mongodb-user/J5K0sM5sQZM"

Thanks,
JRV

Comment by jitendra [ 19/Sep/12 ]

hi,
1. I had the problem at following time
1.1 Fri Sep 7 16:33:05
1.2 Fri Sep 7 16:50:08
2.We will provide Mongostat, when same issue come
3.We will replicate same issue with log level 2 and provide you that log.

Regards,
JRV

Comment by Greg Studer [ 14/Sep/12 ]

Looked at the mongod log, but was curious more about what mongos was doing - the mongos log seems to be from a different time? The mongos log (at logLevel 2) would tell us exactly the config refresh operations that were happening at this time. Is there another time period where you saw the problem?

I see your out-of-file-descriptor messages - 35000 fds is quite a lot of fds to use. Do you really have that many incoming connections? Mongostat will show you this.

> the namespace is not in cache than disk reading required
to fetch the name space into cache.this cause multiple retry for setting shard version.Right??

No, all the config data is stored on the config server. When the version is not known, mongos has to fetch the information from the config server and retry. Mongos logs at logLevel 2 would show the actual retry process.

Comment by jitendra [ 14/Sep/12 ]

Hi,
1. We have 365 databases configured. This is referring as namespace.
We insert only today's namespace. Reading can be only from old namespaces.
when we set ShardVersion for namespace, the namespace is not in cache than disk reading required
to fetch the name space into cache.this cause multiple retry for setting shard version.Right??
2. I had the problem at Fri Sep 7 16:50:08.

Thanks,
JRV

Comment by Greg Studer [ 13/Sep/12 ]

> We have 365 databases configured. I think this is what you are referring as namespace in your first point. So 365 databases can cause a delay during re-establishing new connection. Right??

The shard versions are only sent lazily, as-needed, - but if you use a lot of different databases over one connection it could be. Once you've contacted a database once, the version should be cached for that connection.

> We have socket timeout 1 second ,after socket timeout we again create new connection.
What should be socket timeout for connection to mongos ?

Socket timeouts can cause problems for the same reason query timeouts do - I'd recommend not setting this value. If something goes wrong that causes longer response times it's much easier to fix if retries don't mask the problem.

> I am attaching mongos and mongod logs as requested by you in second point.

Thanks for the logs, but can you tell us the exact time you had the problem? Otherwise it's very hard to determine what's normal behavior and what isn't.

Comment by Greg Studer [ 13/Sep/12 ]

> Now again a question. when does set ShardVersion?

Are you connecting directly from the shell, or through a driver? Drivers often cache connections for re-use. I'd suspect that somehow your connection wasn't being re-established.

Also, are you sure the queries went to the same mongod each time?

Comment by jitendra [ 13/Sep/12 ]

Hi,
i did one testcase:

1> I made new connect to mongos and did query on three different database. I got below logs on mongod:

Thu Sep 13 12:28:00 [conn9] command: { setShardVersion: "00211012.Database"
Thu Sep 13 12:28:38 [conn9] command: { setShardVersion: "00221012.Database"
Thu Sep 13 12:30:36 [conn9] command: { setShardVersion: "00231012.Database"

2> I disconnect the connection.

3> Again made new connect to mongos and did query on three different database. I did not get logs on mongod for setShardVersion.

As your previous reply you said "Shard versions need to be re-established for each new incoming connection and namespace - new connections in mongos".

Now again a question. when does set ShardVersion?

Thanks,
JRV

Comment by jitendra [ 12/Sep/12 ]

Hi ,
Any updates

Thanks,
JRV

Comment by jitendra [ 12/Sep/12 ]

Attached mongos and mongod logs

Comment by jitendra [ 12/Sep/12 ]

Hi, Thanks for the reply. Now I am able to understand what was actually going on in my server.

We have 365 databases configured. I think this is what you are referring as namespace in your first point. So 365 databases can cause a delay during re-establishing new connection. Right??

We have socket timeout 1 second ,after socket timeout we again create new connection.
What should be socket timeout for connection to mongos ?

I am attaching mongos and mongod logs as requested by you in second point.

regards
JRV

Comment by Greg Studer [ 12/Sep/12 ]

Think this may be a bad interaction of your timeouts with the standard version refresh timeouts - if you allow your timeouts to be 100s temporarily, does the issue go away?

If so, one workaround would be to do an initial read from a database when creating a new connection with no timeout (you may already be doing this) - the config info should be loaded for the next read.

Another workaround would be to reuse connections as much as possible - it seems like new connections are triggering this problem?

1> Why does shards version change?

Shard versions need to be re-established for each new incoming connection and namespace - new connections in mongos are "expensive" in this way if you have many namespaces.

2>How many retry may be for "connection meta data too old - will retry ns00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1:"

There may be three round-trips as the connection discovers that the metadata needs to be refreshed, depending on the exact type of staleness here. Can you send a mongos log (at logLevel 2) with a few hours before/after while you are experiencing this problem, along with the time you application experienced timeout errors? More context will let us be more helpful.

3> How much read timeout should be? (In my application 3 sec time out is not good)
We don't recommend setting timeouts unless absolutely necessary, just because they have the potential to cause cluster instability. Connections timeout because the server is slow, then more connections are created, making the server slower, and so on... This may be occurring, as you mention you run out of file descriptors. I'd recommend removing or raising the read timeout and instead using one of the workarounds above until we're able to pinpoint the exact version refresh codepath you're having trouble with.

Comment by jitendra [ 11/Sep/12 ]

Hi ,
Any updates

Thanks,
JRV

Comment by jitendra [ 11/Sep/12 ]

Hi ,
Any updates

Thanks,
JRV

Comment by jitendra [ 10/Sep/12 ]

Hi first thanks for reply.
we are using c++ driver. we see this message in mongos and mongod logs.

We make 1 Database for one day. And we do this activity once a year. When we try to read data form old database (1 or 2 days before present date) then reading times out(Read timeout is 3 second) in the first read. After first reading, it works fine.

This first time timeout issue replicates at least 20-30 times daily.
We have file descriptor max limit 35000. (Sometimes we get too many open files as well)

I want to know following :
1> Why does shards version change?
2>How many retry may be for "connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1:"
3> How much read timeout should be? (In my application 3 sec time out is not good)

Thanks,
JRV

Comment by Greg Studer [ 10/Sep/12 ]

FYI - upgrading to the latest 2.0 minor version recommended too, to be sure you have all the bugfixes.

Comment by Greg Studer [ 10/Sep/12 ]

> ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0

This is not necessarily an error, it's just mongos attempting to correct a stale version. It's a little strange that you get up to three attempts, however. Are you dropping and re-creating collections with the same name? The delay is by-design, as you have more attempts we wait longer to avoid spamming the config server. Do you see an in your application, or just this message in mongos logs.

> Fri Sep 7 16:50:11 [conn17] command admin.$cmd command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID:

This is mongos trying to set the shard version per-connection on mongod.

What driver are you using, and do you see this type of thing all the time or only when you create new connections or are under heavy migration load?

Comment by jitendra [ 10/Sep/12 ]

hi
same time logs at mongod are below:
Fri Sep 7 16:50:08 [conn19] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:08 [conn17] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:08 [conn17] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:08 [conn19] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:09 [conn17] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:09 [conn19] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:10 [conn1] insert 00070912.Database 0ms
Fri Sep 7 16:50:10 [conn1] runQuery called admin.$cmd

{ getlasterror: 1 }

Fri Sep 7 16:50:10 [conn1] run command admin.$cmd

{ getlasterror: 1 }

Fri Sep 7 16:50:10 [conn1] command admin.$cmd command:

{ getlasterror: 1 }

ntoreturn:1 reslen:67 0ms
Fri Sep 7 16:50:11 [conn4] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:11 [conn16] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:11 [conn4] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:11 [conn16] connection meta data too old - will retry ns:(00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1
Fri Sep 7 16:50:11 [conn17] runQuery called admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] run command admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] command admin.$cmd command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

ntoreturn:1 reslen:57 0ms
Fri Sep 7 16:50:11 [conn17] runQuery called admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] run command admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn17] command admin.$cmd command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

ntoreturn:1 reslen:57 0ms
Fri Sep 7 16:50:11 [conn19] runQuery called admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn19] run command admin.$cmd

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn19] command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

Fri Sep 7 16:50:11 [conn19] command admin.$cmd command:

{ setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }

ntoreturn:1 reslen:57 0ms
Fri Sep 7 16:50:11 [conn19] runQuery called 00060912.Database

{ _sk: 1, _id: BinData }

Please help me.

Regards,
JRV

Comment by jitendra [ 10/Sep/12 ]

Hi Please update

Regards,
JRV

Generated at Thu Feb 08 03:13:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.