[SERVER-8870] mongos unaware of database move after movePrimary Created: 06/Mar/13 Updated: 10/Dec/14 Resolved: 01/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.2.1, 2.4.0-rc1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | James Blackburn | Assignee: | James Wahlin |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Steps To Reproduce: |
Note that the mongos containing the stale shard location can be refreshed with either a restart or by running the flushRouterConfig command. |
||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Description |
|
We moved an (unsharded) database from one shard to another using movePrimary command and following the instructions here: Having done that users starting complaining of unauthorized access. Sure enough connecting to their local mongos showed that the database that had been moved, and the system.users collection within it, were empty. I.e. the mongos didn't pick up the fact that the database had moved. This is somewhat worrying, and essentially required us to restart mongos' across the cluster. This makes us worry that, if a process had auth (to admin say) they would be writing to the wrong shard for that database - and we'd experience data loss. It's also worrying that mongoses don't appear to automatically pick up changes like this. |
| Comments |
| Comment by Remon van Vliet [ 13/Aug/13 ] | ||||||||||||||||||||||||||||||
|
We've seen this issue as well. Can someone explain how this is a duplicate of 8059? It seems both are symptoms of an as of yet undefined mongos state propagation issue. @James Blackburn did you create an issue to fix movePrimary? | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 01/Apr/13 ] | ||||||||||||||||||||||||||||||
|
Attaching move_primary.js test script for reproduction of this issue. Will attach to | ||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 01/Apr/13 ] | ||||||||||||||||||||||||||||||
|
same cause as | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 11/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Hi James, I have reproduced the stale mongos configuration both in MongoDB 2.2.1 and 2.4.0-rc1. Note that as an alternative to restarting your mongos instances you can also run the flushRouterConfig command to bring all mongos instances up to date after a move. One word of caution is movePrimary should only be run on a static database. There should be no write traffic to an in-transit database as the writes may be lost. I will pass this ticket on to the 10gen development team responsible for movePrimary. Thanks, | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 08/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Hi James, On further examination, this operation should work for your use case. Given you are not writing to this collection during the move process we would expect the move to succeed and for mongos instances to be aware of the change. Note that writes during the process may be lost, I am going to work to reproduce this today. Thanks, | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 08/Mar/13 ] | ||||||||||||||||||||||||||||||
|
I agree James. I have linked I would also like to encourage you to enter an additional SERVER ticket as type "New Feature", requesting that movePrimary be modified to allow for moving a database from one shard to another (outside of a shard decommission). If you do enter please post the ticket # here. Thanks, | ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 07/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Ok, I see, that's a shame. It would be nice if it were possible to movePrimary on a database after the fact. Otherwise one can't easily rebalance a cluster once the databases have been added. | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 07/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Hi James, As per the movePrimary document page, there is a note stating "Only use movePrimary when removing a shard from a sharded cluster" as well as only use when "the database does not contain any collections with data". I would avoid using this command as it does not work for your purpose. To move the collection your best bet is exporting the collection from MongoDB and then reimporting it. As you would guess this means downtime for the given collection. Thanks, | ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 07/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Hi James, No we didn't do removeShard, as we didn't want to remove the shard. We followed the instructions at: We added a new shard, for load balancing, and wanted to move two unsharded databases (as an initial test) to the new shard to spread the load. We didn't want to remove the source shard from the cluster altogether. Cheers, | ||||||||||||||||||||||||||||||
| Comment by James Wahlin [ 07/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Hi James, Can you confirm that you followed all steps in the Remove Shards from an Existing Sharded Cluster instructions? This should include:
I want to confirm the above to make sure nothing was missed and that you ran your commands against the mongos. If there were any missed steps it could explain why the local mongos processes did not correct configuration. For the dbstats against rs0, seeing data when connected locally is not unexpected. Data is removed from mongod processes post-migration in a lazy manner. When running a sharded cluster, you should be performing all of your CRUD operations against the mongos which will know how to route requests properly. Thanks, | ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 07/Mar/13 ] | ||||||||||||||||||||||||||||||
|
This seems like a pretty serious data-loss bug. It makes us worry that re-balancing while a cluster is in use is also a dangerous thing to do. Is it? | ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 06/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Worse than that one of the databases we moved, which was unsharded now has a split personality:
Connected directly to rs0 (where it used to live):
On rs2, where the data now lives:
| ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 06/Mar/13 ] | ||||||||||||||||||||||||||||||
|
Looks similar to: | ||||||||||||||||||||||||||||||
| Comment by James Blackburn [ 06/Mar/13 ] | ||||||||||||||||||||||||||||||
|
There's nothing interesting in the mongos log:
Apart from the fact that it couldn't auth the user as the tadat_live.system.users collection is empty. |