[SERVER-13410] split does not install metadata under the dblock Created: 31/Mar/14 Updated: 11/Jul/16 Resolved: 01/Apr/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.6.0-rc2 |
| Fix Version/s: | 2.6.0-rc3 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Cailin Nelson | Assignee: | Greg Studer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
We have observed 3 separate instances of the mongod process dying unexpectedly with no message in the log file. In each instance, the last message in the log file was about to log metadata event. In each episodes the mongod in question has been a member of a shard in a cluster. The episodes were observed on three separate physical servers. Log files attached. In the third episode dmesg said the following:
|
| Comments |
| Comment by Githook User [ 01/Apr/14 ] |
|
Author: {u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: |
| Comment by Githook User [ 01/Apr/14 ] |
|
Author: {u'username': u'gregstuder', u'name': u'Greg Studer', u'email': u'greg@10gen.com'}Message: |
| Comment by Greg Studer [ 01/Apr/14 ] |
|
Tentative summary - It's important that mongod collection metadata changes don't happen outside the db write lock - this allows metadata users to make certain assumptions about when it is safe to write. This is/was the case for all metadata changes except for split, which did not previously cause problems because the only information write operations used was (a copy of) the shard version and because split does not change the logical ranges tracked. In 2.6, we now use FieldRef information cached inside the collection metadata to validate updates. This is to ensure we don't change shard key fields, and updates assume the FieldRefs will be valid so long as the lock is held. On split, this isn't the case, so think this is what is causing the crashing (and maybe silent and incorrect update validation). |
| Comment by Greg Studer [ 31/Mar/14 ] |
|
The changelogs of the first two configs agree, there are no changelog entries for any of the three crashes. This means that the clusterWrite is not actually getting sent, and should narrow things down considerably. |