[SERVER-6974] Get Error while reading "Database could not initialize cursor across all shards because : stale config detected for ns:" Created: 08/Sep/12 Updated: 15/Feb/13 Resolved: 01/Oct/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.0.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | jitendra | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
linux OEL |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Hi, While reading from mongos we got error "Fri Sep 7 16:50:08 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0" Log : Fri Sep 7 16:50:08 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0 I have following question are: How to resolve this error |
| Comments |
| Comment by Greg Studer [ 01/Oct/12 ] | ||||||||
|
Opened a new ticket for the config refresh improvement, follow there for any new updates. | ||||||||
| Comment by Greg Studer [ 21/Sep/12 ] | ||||||||
|
So, here's what happened (at least at :33) - You had a pooled, incoming connection to mongos which validated a version to shard0000:
Next, for some reason, shard0000 went down (also, you should really double-check your mongod NUMA settings, as that can wreak all kinds of havok, and upgrade to a supported, non-rc release):
Mongos cleanly detected this later and re-established a connection, but the version info for the connection on shard0000 got erased.
Mongos doesn't know the version info is erased, though, because it refreshes this info lazily. Therefore when you re-use the conn, mongos is confused and refreshes the version more frequently than it otherwise would:
This is all works-as-designed, the problem comes in that there is a backoff delay in mongos when refreshing a version multiple times (to avoid slamming the config servers and mongod). We're working to improve config refresh in general, but this type of staleness is inherent to the system. Workarounds include the suggestions above (maxConns in particular will avoid the connection slamming if this does happen), running flushRouterConfig on mongoses after mongod failures (which will force version reloads), or if you're inclined to recompile your own mongo version you can change the hardcoded delays. | ||||||||
| Comment by jitendra [ 21/Sep/12 ] | ||||||||
|
Hi Adding mongos and mongod logs I had problem at following time 1. Fri Sep 7 16:33:05 | ||||||||
| Comment by Greg Studer [ 20/Sep/12 ] | ||||||||
|
So, a few things: We really do need to get more details to drill down into where the problem you're having is coming from. We have one mongos log, but that log is not for the time period when you had this problem: Fri Sep 7 16:50:09 [conn1121] ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 2 > cat mongos_35000.log | grep "attempt: 2" | wc -l It's hard to say what the multiple retries were for here, which may or may not be causing delays, without the mongos logs during these periods. The mongos logs you provide don't have any similar multiple-retry portions. As you mention though, if you can get similar mongod and mongos logs during one of these time periods (and let me know the time of the problem) we're happy to look again. Secondly, for your connection spikes and high fd use - you can prevent mongos from opening more than a certain number of incoming connections - if your app tries to exceed this limit, mongos will not allow it. You do this by setting --maxConns on the mongos command line. mongos --maxConns 200 (some number greater than the maximum number of connections your app should ever open) This should prevent the high connection issues, the maximum number of connections open to each mongod will be (roughly) 200 * (# of mongoses) (also a few fds will be used for your databases). If there is some problem between the app and mongos causing the high connections, we'll see it much sooner and the mongods won't get overloaded and slow down everybody else. | ||||||||
| Comment by jitendra [ 20/Sep/12 ] | ||||||||
|
Hi please update. Thanks, | ||||||||
| Comment by jitendra [ 20/Sep/12 ] | ||||||||
|
Hi , 1. "Mongos logs at logLevel 2 would show the actual retry process." 2 "See your out-of-file-descriptor messages - 35000 fds is quite a lot of fds to use." Thanks, | ||||||||
| Comment by jitendra [ 19/Sep/12 ] | ||||||||
|
hi, Regards, | ||||||||
| Comment by Greg Studer [ 14/Sep/12 ] | ||||||||
|
Looked at the mongod log, but was curious more about what mongos was doing - the mongos log seems to be from a different time? The mongos log (at logLevel 2) would tell us exactly the config refresh operations that were happening at this time. Is there another time period where you saw the problem? I see your out-of-file-descriptor messages - 35000 fds is quite a lot of fds to use. Do you really have that many incoming connections? Mongostat will show you this. > the namespace is not in cache than disk reading required No, all the config data is stored on the config server. When the version is not known, mongos has to fetch the information from the config server and retry. Mongos logs at logLevel 2 would show the actual retry process. | ||||||||
| Comment by jitendra [ 14/Sep/12 ] | ||||||||
|
Hi, Thanks, | ||||||||
| Comment by Greg Studer [ 13/Sep/12 ] | ||||||||
|
> We have 365 databases configured. I think this is what you are referring as namespace in your first point. So 365 databases can cause a delay during re-establishing new connection. Right?? The shard versions are only sent lazily, as-needed, - but if you use a lot of different databases over one connection it could be. Once you've contacted a database once, the version should be cached for that connection. > We have socket timeout 1 second ,after socket timeout we again create new connection. Socket timeouts can cause problems for the same reason query timeouts do - I'd recommend not setting this value. If something goes wrong that causes longer response times it's much easier to fix if retries don't mask the problem. > I am attaching mongos and mongod logs as requested by you in second point. Thanks for the logs, but can you tell us the exact time you had the problem? Otherwise it's very hard to determine what's normal behavior and what isn't. | ||||||||
| Comment by Greg Studer [ 13/Sep/12 ] | ||||||||
|
> Now again a question. when does set ShardVersion? Are you connecting directly from the shell, or through a driver? Drivers often cache connections for re-use. I'd suspect that somehow your connection wasn't being re-established. Also, are you sure the queries went to the same mongod each time? | ||||||||
| Comment by jitendra [ 13/Sep/12 ] | ||||||||
|
Hi, 1> I made new connect to mongos and did query on three different database. I got below logs on mongod: Thu Sep 13 12:28:00 [conn9] command: { setShardVersion: "00211012.Database" 2> I disconnect the connection. 3> Again made new connect to mongos and did query on three different database. I did not get logs on mongod for setShardVersion. As your previous reply you said "Shard versions need to be re-established for each new incoming connection and namespace - new connections in mongos". Now again a question. when does set ShardVersion? Thanks, | ||||||||
| Comment by jitendra [ 12/Sep/12 ] | ||||||||
|
Hi , Thanks, | ||||||||
| Comment by jitendra [ 12/Sep/12 ] | ||||||||
|
Attached mongos and mongod logs | ||||||||
| Comment by jitendra [ 12/Sep/12 ] | ||||||||
|
Hi, Thanks for the reply. Now I am able to understand what was actually going on in my server. We have 365 databases configured. I think this is what you are referring as namespace in your first point. So 365 databases can cause a delay during re-establishing new connection. Right?? We have socket timeout 1 second ,after socket timeout we again create new connection. I am attaching mongos and mongod logs as requested by you in second point. regards | ||||||||
| Comment by Greg Studer [ 12/Sep/12 ] | ||||||||
|
Think this may be a bad interaction of your timeouts with the standard version refresh timeouts - if you allow your timeouts to be 100s temporarily, does the issue go away? If so, one workaround would be to do an initial read from a database when creating a new connection with no timeout (you may already be doing this) - the config info should be loaded for the next read. Another workaround would be to reuse connections as much as possible - it seems like new connections are triggering this problem? 1> Why does shards version change? Shard versions need to be re-established for each new incoming connection and namespace - new connections in mongos are "expensive" in this way if you have many namespaces. 2>How many retry may be for "connection meta data too old - will retry ns00060912.Database) op:(query) client in sharded mode, but doesn't have version set for this collection: 00060912.Database myVersion: 28|1:" There may be three round-trips as the connection discovers that the metadata needs to be refreshed, depending on the exact type of staleness here. Can you send a mongos log (at logLevel 2) with a few hours before/after while you are experiencing this problem, along with the time you application experienced timeout errors? More context will let us be more helpful. 3> How much read timeout should be? (In my application 3 sec time out is not good) | ||||||||
| Comment by jitendra [ 11/Sep/12 ] | ||||||||
|
Hi , Thanks, | ||||||||
| Comment by jitendra [ 11/Sep/12 ] | ||||||||
|
Hi , Thanks, | ||||||||
| Comment by jitendra [ 10/Sep/12 ] | ||||||||
|
Hi first thanks for reply. We make 1 Database for one day. And we do this activity once a year. When we try to read data form old database (1 or 2 days before present date) then reading times out(Read timeout is 3 second) in the first read. After first reading, it works fine. This first time timeout issue replicates at least 20-30 times daily. I want to know following : Thanks, | ||||||||
| Comment by Greg Studer [ 10/Sep/12 ] | ||||||||
|
FYI - upgrading to the latest 2.0 minor version recommended too, to be sure you have all the bugfixes. | ||||||||
| Comment by Greg Studer [ 10/Sep/12 ] | ||||||||
|
> ns: 00060912.Database could not initialize cursor across all shards because : stale config detected for ns: 00060912.Database ClusteredCursor::_checkCursor @ 192.168.11.103:10000 attempt: 0 This is not necessarily an error, it's just mongos attempting to correct a stale version. It's a little strange that you get up to three attempts, however. Are you dropping and re-creating collections with the same name? The delay is by-design, as you have more attempts we wait longer to avoid spamming the config server. Do you see an in your application, or just this message in mongos logs. > Fri Sep 7 16:50:11 [conn17] command admin.$cmd command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: This is mongos trying to set the shard version per-connection on mongod. What driver are you using, and do you see this type of thing all the time or only when you create new connections or are under heavy migration load? | ||||||||
| Comment by jitendra [ 10/Sep/12 ] | ||||||||
|
hi Fri Sep 7 16:50:10 [conn1] run command admin.$cmd { getlasterror: 1 }Fri Sep 7 16:50:10 [conn1] command admin.$cmd command: { getlasterror: 1 } ntoreturn:1 reslen:67 0ms Fri Sep 7 16:50:11 [conn17] run command admin.$cmd { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn17] command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn17] command admin.$cmd command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" } ntoreturn:1 reslen:57 0ms Fri Sep 7 16:50:11 [conn17] run command admin.$cmd { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn17] command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn17] command admin.$cmd command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" } ntoreturn:1 reslen:57 0ms Fri Sep 7 16:50:11 [conn19] run command admin.$cmd { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn19] command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" }Fri Sep 7 16:50:11 [conn19] command admin.$cmd command: { setShardVersion: "00060912.Database", configdb: "192.168.11.20:62000,192.168.11.20:65000,192.168.11.21:38000", version: Timestamp 28000|1, serverID: ObjectId('5048f7298a1292e3fc592c4d'), shard: "shard0000", shardHost: "192.168.11.103:10000" } ntoreturn:1 reslen:57 0ms
Please help me. Regards, | ||||||||
| Comment by jitendra [ 10/Sep/12 ] | ||||||||
|
Hi Please update Regards, |