[SERVER-8518] Recovering slave with journal causes Invalid BSONObj size -assertions Created: 12/Feb/13  Updated: 08/Mar/13  Resolved: 08/Mar/13

Status: Closed
Project: Core Server
Component/s: Internal Code, Replication, Stability
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Juho Mäkinen Assignee: James Wahlin
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

AWS EC2 m2.2xlarge instances. Instance has four IOPS volumes striped together.


Attachments: Text File mongod.log    
Issue Links:
Related
is related to DOCS-1212 Mention the bug related to Linux, xfs... Closed
is related to SERVER-8867 Command to validate all data within a... Closed
is related to SERVER-8908 Better error log explanation if serve... Closed
Operating System: Linux
Steps To Reproduce:

1) Create 2.2.3 slave, bootstrap it from an existing 2.0.5 replicaset
2) Snapshot the slave live disks
3) Create new slave from these snapshots
4) Let the slave recover with journal

Participants:

 Description   

I'm just upgrading our production cluster running mongodb 2.0.5 into 2.2.3. I setup a new slave (2.2.3) into the existing mongodb 2.0.5 replica-set and I let it bootstrap itself over the network. After this I snapshotted the mongodb storage volumes and created a new slave instance for these (to test recovery from backup).

After the new instance booted it started to recover itself from the journal. Immediately after recovery was completed the slave startet to get assertions about Invalid BSONObj size, which eventually killed the slave.

I've done the entire job twice, only to get exactly same results. There's the slave mongod.log attached.

The snapshots were done with RightScale block_device cookbook scripts. The actual steps are:
1) Lock the underlying XFS filesystem
2) Create LVM snapshot
3) Unlock the underlying XFS filesystem
4) After this the each EBS stripe under LVM is ordered to make an EBS snapshot.
This procedure is well tested by RightScale and should ensure that the snapshot is atomic and physically intact after the stripes are rejoined. The LVM snapshot is used the restore the volume.

My plan is to do a rolling upgrade:
1) First add second slave, with 2.2.3
2) Replace old slave with 2.2.3 by bootstrapping it with a snapshot from the already created slave and to let it catch up after recovering from journal
3) Step the old primary down and do the same for the old primary.



 Comments   
Comment by James Wahlin [ 08/Mar/13 ]

Thanks Juho. I have linked the tickets you created to this one. I am resolving this ticket now as the linked tickets will track action items from this. Please feel free to reopen if you would like to continue discussion here for any reason.

James

Comment by Juho Mäkinen [ 08/Mar/13 ]

Tickets created:

"Mention the bug related to Linux, xfs and rsync on /manual/administration/backups/ page"
https://jira.mongodb.org/browse/DOCS-1212

"Better error log explanation if server notices data file corruption"
https://jira.mongodb.org/browse/SERVER-8908

And the one which I created a few days ago:

"Command to validate all data within a server"
https://jira.mongodb.org/browse/SERVER-8867

Thanks,
Juho

Comment by James Wahlin [ 07/Mar/13 ]

Hi Juho,

Can you enter the documentation ticket as well? Given you have been living and breathing this I think you may do a better job then me in describing how you would like 10gen documentation to reflect. Action items for you are then:

  1. Enter a SERVER ticket asking for improvement of the warning issued on encountering possible file corruption.
  2. Enter a DOCS ticket asking for a reference to this issue in the 10gen backup documentation.

Please reference ticket #s here after entering. I will then link and close out this issue.

Thanks,
James

Comment by James Wahlin [ 06/Mar/13 ]

Hi Juho,

Good work in tracking this down. For your suggestions, I would like to encourage you to enter a SERVER ticket for the improvement on the internal assertion warning (please post a link here). I will enter a separate ticket for our documentation team to add a note to our backups page.

Thanks,
James

Comment by Juho Mäkinen [ 06/Mar/13 ]

I've resolved this bug and fortunately it's not within MongoDB but in the snapshot system which I used. It turned out to be a bug in the kernel, which affects that in some cases the file mtime (last modification timestamp) is not updated correctly when kernel flushes mmap()'ed data from memory to disk. Even more, the bug does not appear for example on ext3, but requires that XFS filesystem is used. Also some other filesystems are affected.

The relevant kernel bug is: https://bugzilla.kernel.org/show_bug.cgi?id=2645

Because of this mtime inconsistency, rsync will not notice that the mongodb data files have changed, so rsync will not sync these files correctly unless the -c command line argument is used, but this causes a massive penalty, because rsync will need to scan all files for differences. This means that using rsync with lvm when backing up mongodb should require extra care.

I wrote a more detailed explanation to my blog at http://www.juhonkoti.net/2013/03/06/snapshot-recovery-problems-with-mongodb-mmap-xfs-filesystem-and-rsync

Suggestions:

Comment by James Wahlin [ 26/Feb/13 ]

Hi Juho,

Thanks for the update on this. I have also seen advice to re-mount your file system as read-only prior to freeze in order to guarantee flush. Note that this is not something I have tried or a 10gen best practice, just something I read in a forum.

Going forward on this ticket, let's leave open to track your solution to the issue. Also please post a link to your pull request once in.

If you would like you are also free to enter a new Jira ticket for global (every db & collection) validate command. If you do so please link to this ticket.

Thanks,
James

Comment by Juho Mäkinen [ 26/Feb/13 ]

I'm now pretty sure that the problem was after all within the snapshot engine and not in mongodb at all.

As I'm stripping between two different EBS volumes, it seems that even after locking the xfs file system and doing LVM level snapshot, the OS write buffers weren't flushed properly to the EBS disks fast enough. So when the two EBS disks were snapshotted, some of the pending writes to the LVM snapshot weren't committed to these disks. This resulted inconsistency within the LVM itself when the LVM snapshot was restored.

I've done now a few tests where I added some 'sync' and 'sleep' commands to the snapshot script, which tries to ensure that the write buffers are flushed to the EBS volumes. I've also got some 'iostat' data, which shows that there's indeed a bunch of IO operations both after xfs freeze and LVM snapshot, which supports the fact that there's pending write operations to the underlying disks even after the corresponding cmd line command has returned.

Another thing is certain: I've experienced similar problems even when I've fsyncLock'ed the mongodb, so even if this problem is not within snapshots (which I now think is very unlikely), the problem is not within the mongodb journal.

What have I(we) learned?

  • I'm not yet fully satisfied with the solution I have with 'sync' and 'sleep' workarounds. I'm hoping that I can find a reasonable solution which doesn't sleep any more than it's absolutely necessary. I can add a note of warning to the mongodb documentation and make a pull request for the docs.
  • Mongodb has a nice db.collection.validate( {full:true}

    ) command, which I've used to verify that the data files have had some corruption after snapshots, but mongodb seems to lack a command to validate every collection in every database. This would probably speeded up debugging this. Even better if it could report to the log files problems as they appear on the scan, allowing to bail out early when corruption is found, resulting even faster iteration speeds.

Comment by Juho Mäkinen [ 22/Feb/13 ]

Thanks. I'll need to check on this one, but I'm starting to move towards that the actual problem is within the snapshots. I have a few ideas which I'll try out and do some tests.

Comment by James Wahlin [ 19/Feb/13 ]

Hi Juho,

I agree that the snapshot process sounds suspect given your tests with a stopped DB.

mongod does have an admin tool which will read your journal files and dump information on the contents to a log file (or to screen if no log specified). It appears to dump operation, database, db file #, write size and the first 8 bytes (looking in to why we limit) of the operation to apply in hex. To perform, run mongod as follows:

mongod --journal --durOptions 7 --dbpath YOUR_DB_PATH_HERE

Thanks,
James

Comment by Juho Mäkinen [ 16/Feb/13 ]

Yes it is.

Comment by Eliot Horowitz (Inactive) [ 16/Feb/13 ]

Is the journal file on the same filesystem as the data?

Comment by Juho Mäkinen [ 16/Feb/13 ]

Both tests recovered just fine without any problems. Also the sha1sums matched perfectly, so the snapshots seems to work. I also tried to kill -9 a mongodb server and to restart it right after without snapshotting quite a few times and I couldn't get the bug reproduced here.

But this also got me thinking. First recall when I tested my snapshot script which locked the database first and that I noticed that there was a small journal file in the snapshot? It was probably because my snapshot script started mongodb again before the actual snapshot was able to complete. In all cases where the bug has manifested, there has been a journal file which was snapshotted.

So maybe the issue is that the underlying filesystem and/or ebs snapshot is not really consistent enough that the journal file gets corrupted somehow. I tried to google out about what's actually inside the journal file, so that I could examine if it's corrupted or not, but I couldn't find any details. I've saved this journal file (it's just about 800KiB) and I'll do more trials to both get my backup script to work so that there's no journal file in the snapshot and to also get more data and trials on this theory. I can also send the journal file to you via email, but I don't want to link it here directly as it might contain some private data.

Comment by James Wahlin [ 14/Feb/13 ]

Hi Juho,

For the kill -9 test, if the instance does start up with no issues after, it may be worth trying several times. I am not confident that this will occur 100% of the time at this point.

For the fsync+lock, I am surprised as well that the journal files were present in your snapshot. If you have it can you post your log from the fysnc+lock period? I would be interested in seeing if there were any errors during this operation.

As for the primary and secondary logs from your repro on Chef snapshot, I don't see anything of note beyond the errors we have seen before in the secondary log.

Otherwise you plan above to test stats3 and stats4 with a clean and unclean shutdown looks good to me. Let me know what you find from your test.

Thanks,
James

Comment by Juho Mäkinen [ 14/Feb/13 ]

I tried this with one fresh slave without being able to reproduce the problem. Here's the exact steps I took:
1) Setup new slave: rs.add({_id:3, host:"ip-10xxxx.ec2.internal", priority:0, hidden:true, votes:0});
2) Waited until the slave was up. This took over eight hours during the night. Total data directory size was around 190GB. The bootstrap was done just when I got back to work.
3) If I recall correctly, the journal/ directory had two files: one at 1GiB and one around 700MiB. I kill -9'ed the mongodb process and started it again. The recovery went without any problems and the system ran just fine.

Being a bit frustrated as this didn't replicate the problem, I forgot to save the mongodb.log file, so I have nothing else to show from this.

Meanwhile I've worked with a Chef snapshot receipt which does the following to make the snapshot:
1) Lock the database with

{fsync:1, lock:true}

admin command
2) Freeze the XFS filesystem
3) Snapshot the filesystem with LVM snapshot
4) Unlock the database
5) Snapshot the underlying EBS volumes with Amazon EBS snapshot tools. The idea is to restore the volume from the LVM snapshot, which is now snapshotted into the EBS snapshot.

I was sure that this would work just fine, so I snapshotted one of my replicas with this and relaunched. To my surprise, the journal/ directory had one journal file at about 3MB. After I started the mongodb on this machine the same bug appeared again, here's the logs:

This is the secondary (aka. stats3.applifier.info) which has problems: https://gist.github.com/garo/27f8678b6e9d93b29c84
This is the primary at the same time: https://gist.github.com/garo/400e6a500bc29e5d163c

I've now launched another slave, called stats4, to speed up working with this issue. To recap, this is our cluster combination:
stats1.applifier.info. our primary, runs 2.0.5. set as priority 2
stats2.applifier.info. our secondary, runs 2.0.5. set as priority 1
stast3.applifier.info. our new secondary, runs 2.2.3. set as hidden with priority 0. Plan is to backup the cluster by snapshotting every hour.
stast4.applifier.info. our another new secondary, which I launched to speed up testing this bug, runs 2.2.3. set as hidden, no votes, priority 0

Our primary receivers about 500 queries and around 400 updates per second. On peak time this can climb up to 2000 queries per second. We don't currently use slaveOk in our queries.

As we're speaking, both stats3 and stats4 are bootstrapping from stats2. Next, unless you have better suggestion, I'm planning the following:
1) Wait until both stats3 and stats4 have bootstrapped.

  • On stats3, shutdown the database gracefully and verify that journal/ directory has no files. After this calculate sha1sum from every file in the data directory. Then execute snapshot and try to relaunch. Verify with sha1sum that the files are intact after snapshot recovery to rule out that the snapshot isn't working. See if can relaunch correctly.
  • On stats4, do the same, but instead of shutdown do kill -9 for the mongodb.

On stats4, I can save the journal files in case I get the bug reproduced, so you can take a look on them.

Comment by James Wahlin [ 13/Feb/13 ]

Hi Juho,

I am going to try to reproduce this on my end. In order to simplify reproduction, can you try to trigger this without the snapshot/resync step? If this is corruption caused by the journal then I would expect reproduction to be possible by:
1) Starting a new 2.2.3 replica, bootstrapping from the network
2) Once caught up, performing a hard kill on this replica
3) Restarting

Additionally can you attach the log file for your primary, covering the period of time for the error above? We have seen in the past updates which can be handled by the primary but are fatal on the secondary. This does not look to be a replication issue given a clean shut down + snapshot works but I would like to confirm.

Thanks,
James

Comment by Juho Mäkinen [ 13/Feb/13 ]

I've verified that if I stop mongodb gracefully, so that the server commits the journal files into the data files and empties the journal directory, then there's no problem with snapshotting the drives and relaunching the EC2 server from these snapshots. It seems that only the journal recovery triggers this problem.

I haven't yet made any further analyze on the actual journal files as I'm not faimilar with any such procedure. Our dataset is fairly large, it's about 180GB including indices, so the bug might be triggered only after sufficient amount of data. I've found another user who suffers from a similar problem with MongoDB 2.2.2, as he said in this thread: https://groups.google.com/d/topic/mongodb-user/H-gYoCPcI_c/discussion

I'm happy to assist, produce more debugging information and to analyze our data to find more details on this issue.

Generated at Thu Feb 08 03:17:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.