[SERVER-11364] 1gb sharded collection has only 3 chunks and is not migrating data Created: 24/Oct/13 Updated: 11/Jul/16 Resolved: 04/Nov/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jeffrey Berger | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
centOS |
||
| Attachments: |
|
| Operating System: | Linux |
| Steps To Reproduce: | Create a new database and enable sharding on it. Then create a collection that is sharded on a hash of the _id field. Then insert a bunch of documents and the chunks stay at 3 even if the collection grows to be a lot more than the maximum size of three chunks. |
| Participants: |
| Description |
|
I created a 3 shard cluster and began to put data on the cluster and found that data was not be distributed across all three shards. It is either distributed on the first two (profilea and profileb) or on the third (profilec). Here is the sh.status() output — Sharding Status —
To test this I created two test collections and inserted documents in order to force mongos to create chunks and move them across the the shards. The shard key was a hash of the _id. The collection testB.shardTest has stats of the following
mongos> db.settings.findOne() { "_id" : "chunksize", "value" : 64 }
|
| Comments |
| Comment by Amalia Hawkins [ 04/Nov/13 ] |
|
Great! I'm glad to hear things are working for you. |
| Comment by Jeffrey Berger [ 04/Nov/13 ] |
|
Yes as soon as I installed ntp and began running ntpd on all the machines in the cluster it began rebalancing and now everything is evenly distributed. |
| Comment by Amalia Hawkins [ 04/Nov/13 ] |
|
Did synching up the time change or fix the behavior? |
| Comment by Jeffrey Berger [ 25/Oct/13 ] |
|
All the nodes have slightly different times, the biggest difference seems to be about 90 seconds. We've informed the admins of the machines and will synch the time up next week. Thanks a bunch for all your help with this. |
| Comment by Eliot Horowitz (Inactive) [ 25/Oct/13 ] |
|
No downside, I would do that. |
| Comment by Jeffrey Berger [ 25/Oct/13 ] |
|
There is a shift of 50 seconds between two of the shards, ntpd is not running. Would there be any downside to immediately syching their times? |
| Comment by Eliot Horowitz (Inactive) [ 25/Oct/13 ] |
|
Ah, great, the clock skew log message is probably the root cause. Can you check the clocks on all of the machines (mongos and mongod) and see if there is one (or many) apart? Are you running ntpd? |
| Comment by Jeffrey Berger [ 25/Oct/13 ] |
|
As of submitting the bug we had one mongos instance, there are now three different ones. The data is being inserted in the shell with the following code : (The junk in there is just to make the document bigger to fill up space faster) I can in fact attach such logs. I've included the mongos log and the mongod log from the primary on profilec. If we need any other logs from any of the other instances let me know and I'll pull them. I have noticed something in the mongos log [Balancer] caught exception while doing balance: error checking clock skew of cluster ec2ev-qaprofconf1.sailthru.pvt:27019,ec2ev-qaprofconf2.sailthru.pvt:27019,ec2ev-qaprofconf3.sailthru.pvt:27019 :: caused by :: 13650 clock skew of the cluster ec2ev-qaprofconf1.sailthru.pvt:27019,ec2ev-qaprofconf2.sailthru.pvt:27019,ec2ev-qaprofconf3.sailthru.pvt:27019 is too far out of bounds to allow distributed locking. This has persisted from when we have dropped the shard from the cluster and then re-added it, there was no effect. If this is the root cause of us not being able to split and balance across the cluster we have not been able to recover from this error. What are the causes of this and how would we bring the cluster back to operational? Thanks for all the help. |
| Comment by Eliot Horowitz (Inactive) [ 25/Oct/13 ] |
|
You are definitely right that this is related to splitting.
|