[SERVER-8518] Recovering slave with journal causes Invalid BSONObj size -assertions Created: 12/Feb/13 Updated: 08/Mar/13 Resolved: 08/Mar/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code, Replication, Stability |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Juho Mäkinen | Assignee: | James Wahlin |
| Resolution: | Done | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
AWS EC2 m2.2xlarge instances. Instance has four IOPS volumes striped together. |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Steps To Reproduce: | 1) Create 2.2.3 slave, bootstrap it from an existing 2.0.5 replicaset |
||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
I'm just upgrading our production cluster running mongodb 2.0.5 into 2.2.3. I setup a new slave (2.2.3) into the existing mongodb 2.0.5 replica-set and I let it bootstrap itself over the network. After this I snapshotted the mongodb storage volumes and created a new slave instance for these (to test recovery from backup). After the new instance booted it started to recover itself from the journal. Immediately after recovery was completed the slave startet to get assertions about Invalid BSONObj size, which eventually killed the slave. I've done the entire job twice, only to get exactly same results. There's the slave mongod.log attached. The snapshots were done with RightScale block_device cookbook scripts. The actual steps are: My plan is to do a rolling upgrade: |
| Comments |
| Comment by James Wahlin [ 08/Mar/13 ] | |
|
Thanks Juho. I have linked the tickets you created to this one. I am resolving this ticket now as the linked tickets will track action items from this. Please feel free to reopen if you would like to continue discussion here for any reason. James | |
| Comment by Juho Mäkinen [ 08/Mar/13 ] | |
|
Tickets created: "Mention the bug related to Linux, xfs and rsync on /manual/administration/backups/ page" "Better error log explanation if server notices data file corruption" And the one which I created a few days ago: "Command to validate all data within a server" Thanks, | |
| Comment by James Wahlin [ 07/Mar/13 ] | |
|
Hi Juho, Can you enter the documentation ticket as well? Given you have been living and breathing this I think you may do a better job then me in describing how you would like 10gen documentation to reflect. Action items for you are then:
Please reference ticket #s here after entering. I will then link and close out this issue. Thanks, | |
| Comment by James Wahlin [ 06/Mar/13 ] | |
|
Hi Juho, Good work in tracking this down. For your suggestions, I would like to encourage you to enter a SERVER ticket for the improvement on the internal assertion warning (please post a link here). I will enter a separate ticket for our documentation team to add a note to our backups page. Thanks, | |
| Comment by Juho Mäkinen [ 06/Mar/13 ] | |
|
I've resolved this bug and fortunately it's not within MongoDB but in the snapshot system which I used. It turned out to be a bug in the kernel, which affects that in some cases the file mtime (last modification timestamp) is not updated correctly when kernel flushes mmap()'ed data from memory to disk. Even more, the bug does not appear for example on ext3, but requires that XFS filesystem is used. Also some other filesystems are affected. The relevant kernel bug is: https://bugzilla.kernel.org/show_bug.cgi?id=2645 Because of this mtime inconsistency, rsync will not notice that the mongodb data files have changed, so rsync will not sync these files correctly unless the -c command line argument is used, but this causes a massive penalty, because rsync will need to scan all files for differences. This means that using rsync with lvm when backing up mongodb should require extra care. I wrote a more detailed explanation to my blog at http://www.juhonkoti.net/2013/03/06/snapshot-recovery-problems-with-mongodb-mmap-xfs-filesystem-and-rsync Suggestions:
| |
| Comment by James Wahlin [ 26/Feb/13 ] | |
|
Hi Juho, Thanks for the update on this. I have also seen advice to re-mount your file system as read-only prior to freeze in order to guarantee flush. Note that this is not something I have tried or a 10gen best practice, just something I read in a forum. Going forward on this ticket, let's leave open to track your solution to the issue. Also please post a link to your pull request once in. If you would like you are also free to enter a new Jira ticket for global (every db & collection) validate command. If you do so please link to this ticket. Thanks, | |
| Comment by Juho Mäkinen [ 26/Feb/13 ] | |
|
I'm now pretty sure that the problem was after all within the snapshot engine and not in mongodb at all. As I'm stripping between two different EBS volumes, it seems that even after locking the xfs file system and doing LVM level snapshot, the OS write buffers weren't flushed properly to the EBS disks fast enough. So when the two EBS disks were snapshotted, some of the pending writes to the LVM snapshot weren't committed to these disks. This resulted inconsistency within the LVM itself when the LVM snapshot was restored. I've done now a few tests where I added some 'sync' and 'sleep' commands to the snapshot script, which tries to ensure that the write buffers are flushed to the EBS volumes. I've also got some 'iostat' data, which shows that there's indeed a bunch of IO operations both after xfs freeze and LVM snapshot, which supports the fact that there's pending write operations to the underlying disks even after the corresponding cmd line command has returned. Another thing is certain: I've experienced similar problems even when I've fsyncLock'ed the mongodb, so even if this problem is not within snapshots (which I now think is very unlikely), the problem is not within the mongodb journal. What have I(we) learned?
| |
| Comment by Juho Mäkinen [ 22/Feb/13 ] | |
|
Thanks. I'll need to check on this one, but I'm starting to move towards that the actual problem is within the snapshots. I have a few ideas which I'll try out and do some tests. | |
| Comment by James Wahlin [ 19/Feb/13 ] | |
|
Hi Juho, I agree that the snapshot process sounds suspect given your tests with a stopped DB. mongod does have an admin tool which will read your journal files and dump information on the contents to a log file (or to screen if no log specified). It appears to dump operation, database, db file #, write size and the first 8 bytes (looking in to why we limit) of the operation to apply in hex. To perform, run mongod as follows:
Thanks, | |
| Comment by Juho Mäkinen [ 16/Feb/13 ] | |
|
Yes it is. | |
| Comment by Eliot Horowitz (Inactive) [ 16/Feb/13 ] | |
|
Is the journal file on the same filesystem as the data? | |
| Comment by Juho Mäkinen [ 16/Feb/13 ] | |
|
Both tests recovered just fine without any problems. Also the sha1sums matched perfectly, so the snapshots seems to work. I also tried to kill -9 a mongodb server and to restart it right after without snapshotting quite a few times and I couldn't get the bug reproduced here. But this also got me thinking. First recall when I tested my snapshot script which locked the database first and that I noticed that there was a small journal file in the snapshot? It was probably because my snapshot script started mongodb again before the actual snapshot was able to complete. In all cases where the bug has manifested, there has been a journal file which was snapshotted. So maybe the issue is that the underlying filesystem and/or ebs snapshot is not really consistent enough that the journal file gets corrupted somehow. I tried to google out about what's actually inside the journal file, so that I could examine if it's corrupted or not, but I couldn't find any details. I've saved this journal file (it's just about 800KiB) and I'll do more trials to both get my backup script to work so that there's no journal file in the snapshot and to also get more data and trials on this theory. I can also send the journal file to you via email, but I don't want to link it here directly as it might contain some private data. | |
| Comment by James Wahlin [ 14/Feb/13 ] | |
|
Hi Juho, For the kill -9 test, if the instance does start up with no issues after, it may be worth trying several times. I am not confident that this will occur 100% of the time at this point. For the fsync+lock, I am surprised as well that the journal files were present in your snapshot. If you have it can you post your log from the fysnc+lock period? I would be interested in seeing if there were any errors during this operation. As for the primary and secondary logs from your repro on Chef snapshot, I don't see anything of note beyond the errors we have seen before in the secondary log. Otherwise you plan above to test stats3 and stats4 with a clean and unclean shutdown looks good to me. Let me know what you find from your test. Thanks, | |
| Comment by Juho Mäkinen [ 14/Feb/13 ] | |
|
I tried this with one fresh slave without being able to reproduce the problem. Here's the exact steps I took: Being a bit frustrated as this didn't replicate the problem, I forgot to save the mongodb.log file, so I have nothing else to show from this. Meanwhile I've worked with a Chef snapshot receipt which does the following to make the snapshot: admin command I was sure that this would work just fine, so I snapshotted one of my replicas with this and relaunched. To my surprise, the journal/ directory had one journal file at about 3MB. After I started the mongodb on this machine the same bug appeared again, here's the logs: This is the secondary (aka. stats3.applifier.info) which has problems: https://gist.github.com/garo/27f8678b6e9d93b29c84 I've now launched another slave, called stats4, to speed up working with this issue. To recap, this is our cluster combination: Our primary receivers about 500 queries and around 400 updates per second. On peak time this can climb up to 2000 queries per second. We don't currently use slaveOk in our queries. As we're speaking, both stats3 and stats4 are bootstrapping from stats2. Next, unless you have better suggestion, I'm planning the following:
On stats4, I can save the journal files in case I get the bug reproduced, so you can take a look on them. | |
| Comment by James Wahlin [ 13/Feb/13 ] | |
|
Hi Juho, I am going to try to reproduce this on my end. In order to simplify reproduction, can you try to trigger this without the snapshot/resync step? If this is corruption caused by the journal then I would expect reproduction to be possible by: Additionally can you attach the log file for your primary, covering the period of time for the error above? We have seen in the past updates which can be handled by the primary but are fatal on the secondary. This does not look to be a replication issue given a clean shut down + snapshot works but I would like to confirm. Thanks, | |
| Comment by Juho Mäkinen [ 13/Feb/13 ] | |
|
I've verified that if I stop mongodb gracefully, so that the server commits the journal files into the data files and empties the journal directory, then there's no problem with snapshotting the drives and relaunching the EC2 server from these snapshots. It seems that only the journal recovery triggers this problem. I haven't yet made any further analyze on the actual journal files as I'm not faimilar with any such procedure. Our dataset is fairly large, it's about 180GB including indices, so the bug might be triggered only after sufficient amount of data. I've found another user who suffers from a similar problem with MongoDB 2.2.2, as he said in this thread: https://groups.google.com/d/topic/mongodb-user/H-gYoCPcI_c/discussion I'm happy to assist, produce more debugging information and to analyze our data to find more details on this issue. |