[SERVER-15691] acquiring balancer lock may fail and get stuck with concurrent write traffic Created: 16/Oct/14 Updated: 25/Apr/16 Resolved: 22/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 2.4.12, 2.6.5, 2.7.8 |
| Fix Version/s: | 3.0.0-rc6 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Rui Zhang (Inactive) | Assignee: | Randolph Tan |
| Resolution: | Done | Votes: | 0 |
| Labels: | 28qa | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
during concurrent insert testing, saw chunk stay with the primary shard without being balanced. assertion error in log file. |
| Comments |
| Comment by Randolph Tan [ 22/Jan/15 ] | |||||||||||||||||||||||||||||||
|
Background: Whenever a new lock document is created, it is initialized with a ts of OID(0). Issue: When there are multiple threads trying to create a new lock document, some of them will fail to insert the new lock document because config.locks collection has a unique index over the { ts: 1 } field. Depending on the interleaving of the threads, it is possible the unique index violation will be triggered in the 2nd or 3rd config server. In that case, the lock will be in a state where it can never be acquired without manual intervention. Note that lock documents are never deleted by the system, they are set to the unlock state after they are released, so this particular issue can only happen when the lock is about to be used for the first time. Fix: Make the index non-unique. Existing cluster need to manually drop their { ts: 1 } index and rebuild them with { unique: false }. | |||||||||||||||||||||||||||||||
| Comment by Daniel Pasette (Inactive) [ 22/Jan/15 ] | |||||||||||||||||||||||||||||||
|
randolph is making a follow on ticket for the work involved to have users upgrading to drop their existing { ts: 1 }index in config.locks. | |||||||||||||||||||||||||||||||
| Comment by Githook User [ 22/Jan/15 ] | |||||||||||||||||||||||||||||||
|
Author: {u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}Message: Make config.locks { ts: 1 } not unique | |||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 14/Jan/15 ] | |||||||||||||||||||||||||||||||
|
Based on the latest config, the lock document is missing on the third config server. I still haven't figured out why it was not able to send the upsert to the third config server. The primary logs in the attachment appears to be a different instance from the one in the config dump as the processID for the lock doesn't match. | |||||||||||||||||||||||||||||||
| Comment by Rui Zhang (Inactive) [ 13/Jan/15 ] | |||||||||||||||||||||||||||||||
|
renctan here are three dumps from a new test. these are the error from this run in the primary shard log
| |||||||||||||||||||||||||||||||
| Comment by Julian Wissmann [ 15/Dec/14 ] | |||||||||||||||||||||||||||||||
|
I can confirm this also happens with less than 10 concurrent connections. I've noticed this on 5 concurrent worker threads for inserting from Java. I open one MongoClient and have my threads hold their own DBCollection. Is it advisable to move to each thread holding its own client instance? | |||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 20/Oct/14 ] | |||||||||||||||||||||||||||||||
|
rui.zhang Would you be able to also provide the test script that can reproduce this if you have one? Thanks! | |||||||||||||||||||||||||||||||
| Comment by Rui Zhang (Inactive) [ 16/Oct/14 ] | |||||||||||||||||||||||||||||||
|
This is one chart of chunk distribution with 2.7.8-pre-
sometime, the primary shards could hold all 60~ chunks. test is run with 2.7.8-pre- and 2.6.5, can see issue with both. from log: 1) 1-2 second after I start insert traffic, I got splitChunk failure
2) later saw assertion backtrace like this.
|