[SERVER-32459] mongos incorrectly assumed collection was unsharded Created: 26/Dec/17 Updated: 11/Jan/23 Resolved: 22/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Sven Henderson | Assignee: | Randolph Tan |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 14.04 |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Steps To Reproduce: | Unknown. |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We have a sharded collection with approximately 170k chunks (spread across 11 replica sets) that a single mongos process incorrectly assumed was unsharded. This caused all reads/writes from that mongos process for the collection to go to the primary RS resulting in data inconsistency/loss from the view of that mongos process and the rest of the mongos processes in our environment.
It is worth noting we were manually splitting large chunks for collection_1 in a script with the split admin command at the time. These are the relevant logs from the instance running the script.
|
| Comments |
| Comment by Randolph Tan [ 22/Jan/18 ] | ||
|
After looking at the primary shard logs again, I believe you are experiencing | ||
| Comment by Sven Henderson [ 17/Jan/18 ] | ||
|
Hi Randolph, We did not issue any queries through the mongos that assumed the collection was unsharded so I cannot completely confirm what it was doing. However, from the point of view of a working mongos process the data that was being inserted into the unsharded collection on the primary RS (presumably by the confused mongos) was not visible. Those ~20k documents were "missing" from the collection. To recover those lost documents we ran flushRouterConfig on the confused mongos (this stopped docs from being inserted), dumped the 20k docs directly from the mongod instance in the primary RS, and then restored those docs back into a working mongos. | ||
| Comment by Randolph Tan [ 17/Jan/18 ] | ||
|
Hi, May I ask whether you saw inconsistencies or was just suspecting that there might be inconsistencies? The reason why I'm asking is because the "inconsistent chunk" logs just means that either the config server metadata was somewhat corrupted or the mongos screwed up. If it's the former, you will keep on hitting this error indefinitely. If it's the latter, mongos will simply clear the cache and reload again. The latter will not cause any data loss/inconsistency but it can cause queries to be delayed because it will need to load the full metadata for that collection again instead of doing it incrementally. We fixed an issue ( | ||
| Comment by Sven Henderson [ 17/Jan/18 ] | ||
|
Hi Randolph, Almost all of our queries use the default read/write concern and read preference with a very small number reading only from the secondaries. For the collection in question, I can really only confirm that in the hour that it was assumed to be unsharded, mongos inserted ~20k docs into the collection on the primary RS. | ||
| Comment by Randolph Tan [ 03/Jan/18 ] | ||
|
Hi, What kind of reads and writes are you using? Most of the reads and writes are versioned, so even if the mongos was wrong, the shard will return a stale version error and cause the mongos to refresh the routing table (unless the shard just restarted or just became primary recently - | ||
| Comment by Sven Henderson [ 26/Dec/17 ] | ||
|
Hi Kelsey, Your quick response is much appreciated. To answer your questions...
Let me know if I can provide any additional information. Sven | ||
| Comment by Kelsey Schubert [ 26/Dec/17 ] | ||
|
Hi sven@trello.com, Thank you for reporting this issue. So we can better understand what occurred, would you please provide some additional information?
I've created a secure upload portal for you to use to provide these files. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time. Thank you again, |