[SERVER-4340] Query not returning data from specific shard when querying on anything but "_id" only Created: 21/Nov/11 Updated: 08/Mar/13 Resolved: 05/Sep/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Replication, Sharding |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Koen Calliauw | Assignee: | Spencer Brody (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | limit, replicaset, sharding, skip | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian Squeeze |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
Hi, We've noticed that on one of our sharded collection we cannot do any non-specific queries. For example: db.log_service.find() returns nothing, querying on anything other than ) returns the document. The environment is sharded over 3 replicaton sets, all of which currently only have a primary and an arbiter. The secondary has been taken down due to issues non-related to this. The configuration of the replicasets has been adjusted and rs.reconfig() has been run on all primaries. We have noticed however that mongos keeps complaining in the logs about not being able to reach these servers. Not sure if this is related to the issue. Having previously discussed this on IRC, I've prepared a few pastes: db.printShardingStatus(); mongos log extract: rs.status() on replSetB primary (the only shard that has data for that collection as an automated script removed the data from the other shards, not from this one as aparently it couldn't find any): explain() on the working query: ) explain() on the failing query: We first noticed the issue when we couldn't query data with skip() set to a number higher than what was available on replSetA. At that time find() without limit and skip worked. Now, replSetA is empty and these stopped working. Best regards, |
| Comments |
| Comment by Spencer Brody (Inactive) [ 20/Jun/12 ] | |
|
Hi Koen, | |
| Comment by Koen Calliauw [ 07/Feb/12 ] | |
|
Hi Spencer, This is still the case, however, something changed, please see the explain() below, which was ran from mongos. Notice how n: is now 0. We've upgraded everything to 2.0.2 in the meantime. explain() from mongos: http://pastie.org/private/pmwbd1ov1r7f8pzdjmucg mongos log (LogLevel 10): The mongod's I can't set the LogLevel to 10 (error: http://pastie.org/private/lhzxcjw6g1jlq8sqknsra) So, to answer your question, this is still happening but we can only reproduce it on this collection with this specific data (hasn't happened anywhere else yet). Kind regards, | |
| Comment by Spencer Brody (Inactive) [ 19/Jan/12 ] | |
|
Hi Koen. Is this still happening for you? Have you tried upgrading to 2.0.2? | |
| Comment by Spencer Brody (Inactive) [ 19/Dec/11 ] | |
|
This is the same paste you attached earlier in the ticket using log level of 5. Can you capture a new log of trying the query and having it fail while running with loglevel 10? I'd also like logs form the primary of every shard. You can attach files to this ticket, and if you zip/tar the log files you should be able to just attach the full files directly. | |
| Comment by Koen Calliauw [ 16/Dec/11 ] | |
|
Hi Spencer, You can find the paste here: The actual query happened at 09:58:53 Best regards, | |
| Comment by Spencer Brody (Inactive) [ 09/Dec/11 ] | |
|
Actually, for good measure, please attach the logs from the primaries of each shard for the same period as well. To confirm, you're not doing slaveOk queries, right? | |
| Comment by Spencer Brody (Inactive) [ 09/Dec/11 ] | |
|
I want the logs from the mongos running 2.0.1 at log level 10 for when you try to do a generic query and it fails. | |
| Comment by Koen Calliauw [ 09/Dec/11 ] | |
|
Hi Spencer, Can you tell me from which timeframe you need these logs and from which servers? Also, on 1.8.4 or 2.0.1? /Koen | |
| Comment by Spencer Brody (Inactive) [ 07/Dec/11 ] | |
|
Can you attach the logs from log level 10? There's one specific message I'm looking for. As for the "shard version not ok" error, that's from | |
| Comment by Koen Calliauw [ 07/Dec/11 ] | |
|
Believe it or not, but the query works fine with 1.8.4 mongos! (waiting about 1 minute after mongos was started. Any earlier again didn't return anything) after reverting back to 2.0.1, query stopped working again. I actually don't see any additional stuff in the logs when setting the loglevel to 10. I ran the command through mongos. What I also noticed the last few days is that I get these messages: /Koen | |
| Comment by Spencer Brody (Inactive) [ 05/Dec/11 ] | |
|
Also, could you try it one more time on 2.0.1 after setting the log level to 10 (sorry for making you do that twice, I thought 5 was the highest log level but actually realized that there's some useful information at levels higher than that). After getting the log output with level 10 verbosity you can set the log level back to normal to avoid filling up your disk too quickly. | |
| Comment by Spencer Brody (Inactive) [ 05/Dec/11 ] | |
|
Could you try bringing up a 1.8.4 mongos and see if you see the same problem when querying through it or if any different errors are thrown when using 1.8.4? It should be fine to use mongos from 1.8.4 with a 2.0.1 cluster. | |
| Comment by Koen Calliauw [ 05/Dec/11 ] | |
|
Hi, Anything else I can try? Data is still stuck on this server. /Koen PS: I also reset the loglevel back now. Disk filled up after a few hours. | |
| Comment by Koen Calliauw [ 29/Nov/11 ] | |
|
Hi Spencer, You can find the paste here: /Koen | |
| Comment by Spencer Brody (Inactive) [ 28/Nov/11 ] | |
|
Can you try upping the logging verbosity on the mongos by running
And then try querying and send the log output? | |
| Comment by Koen Calliauw [ 28/Nov/11 ] | |
|
Yes, everything has even been rebooted in the meantime, so all mongod, mongos and config servers have been restarted at least once or twice since the issue started showing. /Koen | |
| Comment by Eliot Horowitz (Inactive) [ 28/Nov/11 ] | |
|
Have you tried bouncing all mongod and mongos? | |
| Comment by Koen Calliauw [ 27/Nov/11 ] | |
|
Also, I doubt that removing the secondaries has anything to do with this issue, since we were able to query the 2 other shards perfectly, also after removing the replset secondaries there. /Koen | |
| Comment by Koen Calliauw [ 27/Nov/11 ] | |
|
Hi Eliot, Right now, we don't have an active secondary, no. Due to technical issues we had to take them offline temporary. We won't be able to add them back in until this issue is fixed. We don't know precisely when it started happening. We had a script that emptied these collections and puts it on another server. The issue was only visible when the script had finished and we noticed that there was still data left on replSetB which we could not query in the usual way. The replicaset secondaries were removed somewhere at the same time. Repairs happened on several occasions to make some diskspace. /Koen | |
| Comment by Eliot Horowitz (Inactive) [ 27/Nov/11 ] | |
|
You have a secondary for each shard though, right? When did this error start happening? Was there ever a repair()? Sorry for all the questions - just hard to understand exactly what's going on. | |
| Comment by Koen Calliauw [ 26/Nov/11 ] | |
|
Hi Eliot, The bets and log_service were sharded at the same time if I remember correctly. Every shard is a replica set. As mentioned previously every replica set only has a primary and an arbiter for the time being. This isn't an issue I suppose? We only implemented the replicasets for redundancy, for now, we never queried with the option to prefer slaves. Besides the technical issues (diskspace) we had no previous issues with the replicasets. /Koen | |
| Comment by Eliot Horowitz (Inactive) [ 26/Nov/11 ] | |
|
When did you shard bets and log_service? What about the replica sets? Any issues on those recently? | |
| Comment by Koen Calliauw [ 26/Nov/11 ] | |
|
Hi, The db.stats() printout of the primaries of the 3 shards can be found here: http://pastie.org/private/ketwnaca5ufrar3mwlfq A bit of history, well, we've been running this sharded setup for about 3 months in production now, give or take a few days, but due to disk space constraints we had to turn logappend off, so unfortunately I don't have logs going back further than yesterday
We chose this type of config on this setup because we are quite restrained on budget and resources. This is running on a hosted environment based on KVM virtualization without overcommitment of cpu or memory. If you need more history, can you please be a bit more specific on what you mean by that? I'll gladly provide you with any and all information I can give you. /Koen | |
| Comment by Eliot Horowitz (Inactive) [ 26/Nov/11 ] | |
|
Can you run db.stats() on each of the shards? Can you explain a bit more of the history. When did you add sharding for example. Do you have the logs going back that far? | |
| Comment by Koen Calliauw [ 26/Nov/11 ] | |
|
Yesterday we rebooted the app servers and all mongo servers for some hardware changes. All stayed the same afterward. Still cannot query data from that collection. | |
| Comment by Koen Calliauw [ 23/Nov/11 ] | |
|
mongod logs from replSet primaries | |
| Comment by Koen Calliauw [ 23/Nov/11 ] | |
|
From the mailinglist: Everything (all mongos and mongod's) is running 2.0.1. Primary of replSetA: http://pastie.org/2907612
| |
| Comment by Spencer Brody (Inactive) [ 22/Nov/11 ] | |
|
You said that replSetB is the shard with the data, and you also mentioned that replSetA is empty. What do you mean by that? The output of printShardingStatus() shows many chunks living on replSetA. Can you also attach the full mongos log as well as the full logs from the primaries of all 3 replSets? Log files generally compress very well, so if you zip/tar the files they should be quite small and can be attached to this ticket no problem. | |
| Comment by Koen Calliauw [ 22/Nov/11 ] | |
|
In the mongod log if the primary in shard replSetB (the one with the data) we see the queries: I've also noticed this warning. Don't know if it's important: |