[SERVER-3367] Minor Data Loss for Slave (master/slave setup) Created: 05/Jul/11 Updated: 12/Jul/16 Resolved: 19/Jan/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability, Storage |
| Affects Version/s: | 1.8.0, 1.8.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Stone, Gao | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | dataloss | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu server 10.10 64bit |
||
| Issue Links: |
|
||||||||||||||||
| Operating System: | Linux | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Our current master db server had no enough RAM, so I tried manually switch one of the slave to master. Following is the stats:
> db.printReplicationInfo() > db.my_collection.count()
> db.my_collection.count() *on slave 2 |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 19/Jan/12 ] |
|
Please open a new issue if you see this in 2.0.2. |
| Comment by Stone, Gao [ 17/Dec/11 ] |
|
We have an old slave using mongodb 2.0.1, which has the initial sync data loss bug, so some docs lost after the initial sync. I mean if this bug only happens when setup a slave and do the initial sync, docs added after the initial sync should not got lost. For example: After initial sync: Master has 10000 docs, but slave only got 9000 docs after initial sync due to the bug. # docs on Master - # docs on slave should always equal 1000 if the bug only effects the initial stage. Otherwise, if data loss bug also happens after initial sync the gap between M/S will be larger than (>) 1000 But last week I noticed that our old slave running 2.0.1 didn't pick up the docs added in the Master DB after the initial sync. I will keep an eye on the data loss issue of the new slave (2.0.2). Hope it will never happen. |
| Comment by Eliot Horowitz (Inactive) [ 17/Dec/11 ] |
|
What do you mean 2 docs got lost? |
| Comment by Stone, Gao [ 17/Dec/11 ] |
|
Thanks for your hard work. It seems that 2.0.2 fixed the initial sync data loss bug. Are your sure the data loss bug only effects initial sync? I noticed that two docs lost again after the initial sync on our old slave (2.0.1). If that bug only happens at initial sync, the follow-up docs should not disappear on slave. |
| Comment by Kristina Chodorow (Inactive) [ 16/Dec/11 ] |
|
Can you please try 2.0.2 and make sure that it works correctly for you? |
| Comment by auto [ 30/Nov/11 ] |
|
Author: {u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: add tests for recloning missing docs |
| Comment by auto [ 14/Nov/11 ] |
|
Author: {u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: Generalize recloning docs on initial oplog application Conflicts: db/repl.cpp |
| Comment by Eliot Horowitz (Inactive) [ 14/Nov/11 ] |
|
Stone, Thanks for the feedback. You are correct that I missed this case in my statement about immediate fixes and it should have been fixed much faster once the bug was found. Thanks for pointing it out. We appreciate pointers from the community when we screw up. What issues with sharding are you referring to? Just the count() being eventually consistent? Its only off during a migration, and you can always do an index scan to get the exact global number if you need. Are you still seeing problems with larger than ram data sets in 2.0? There are lots of improvements in 2.0, more coming in 2.2. -Eliot |
| Comment by Stone, Gao [ 14/Nov/11 ] |
|
@Dwight thanks for your advices, finally, I got this helpful expected responsible response after 4 months! I followed the recent buzz about 'Don’t use MongoDB' , although the original author confessed that it's a hoax, admit or not, it did tell some truth about mongodb and most part of the article is truth and it's not FUD.
I have some thing to say and I am not sure whether this is the right place, we like mongodb and thanks for you guys at 10gen invest so much effort to keep mongodb evolving so fast. But as a long time mongodb user (since 1.6.x) and also have built high load write heavy products with mongodb, I think it's time to give some feedbacks about the problems of mongodb.
We engineers are born to solve tech problems. There's no bug-free software. But when bugs do happen, we should just face the damn reality and try our best to find the bug, review the source code and fix it ASAP. For the problem of data loss, @Eliot (maybe also 10gen) was not honest and was trying to hide the truth that data loss did happen. I submitted this bug 4 months ago and @Eliot was involved. But @Eliot responded the post with the following: "There has never been a case of a record disappearing that we either have not been able to trace to a bug that was fixed immediately, or other environmental issues. If you can link to a case number, we can at least try to understand or explain what happened. Clearly a case like this would be incredibly serious, and if this did happen to you I hope you told us and if you did, we were able to understand and fix immediately." And 10gen president Max Schireson with the following: "we do not believe there have been substantiated reports of data loss when the system is used in the recommended way." Maybe you guys finally figured out that data loss is critical for a database. But problems don't solved by hiding the truth, it's solved by being honest and face the reality and invest more effort on hard and most important problems.
Yes. Mongodb's api is sweet and it's quick for prototyping and low load usage. MongoDB's version # changes fast but the hard and important issues keeps unresolved. from 1.6.x to the latest 2.0.1, durability (journaling) is the only critical things introduced. And sharding is still not very reliable. For most of the users, I think cool feature x,y,z is not that important (which might be for certain use cases), but data integrity, sharding reliability, fine grained locking, efficiency of storage are critical things for all the users. You guys are smart. But please change the mongodb todo-list priority list, and fix the critical stuff before adding more cool features, thank you! ( We really don't want migrate so much legacy data to other solutions because of the long unresolved issues, but we are considering)
1). Sharding reliability. 2). Fine grained locking. This is partially solved by one write intensive collection per db. But we can go no further, I think document level locking might be more helpful. 3). Efficiency of storage. 4). More reliable storage and memory management engine Performance degraded rapidly when our data can't put in RAM. Sometimes we noticed that mongodb stopped respond every 10s when faults. If mongodb can be as deterministic as mysql when data not fit in RAM, it will be pretty helpful for large data sets. Above is just my personal feedbacks and opinions as a long time mongodb user. Hope it helps make mongodb better and better a little bit. Thanks again for your hard work to give us an alternative when choosing database. Stone |
| Comment by auto [ 11/Nov/11 ] |
|
Author: {u'login': u'kchodorow', u'name': u'Kristina', u'email': u'kristina@10gen.com'}Message: Generalize recloning docs on initial oplog application |
| Comment by Dwight Merriman [ 11/Nov/11 ] |
|
@kristina you are right the fix is not in master/slave only in replica sets. 703ca00a5749c8660d7a975c9d03ae585d790ddb marks the line that needs work in the code. @stone kristina will fix. possible solutions: (1) use a build which has the replica set fix mentioned above, and use replica sets instead of master/slave. if you are doing an initial sync this is just as easy as a fresh slave instantiation anyway. this is a good idea as future work on replication will mostly be on replica sets; master/slave is the "old" replication in mongo. |
| Comment by Kristina Chodorow (Inactive) [ 09/Nov/11 ] |
|
I'll let you know as soon as there's something to try. I got sick last week and just got back. I haven't fixed anything yet, but I'll try to figure it out tomorrow. |
| Comment by Stone, Gao [ 04/Nov/11 ] |
|
tried again with 2.0.1 for a small collection (~ 1M docs), sync'ed a new slave, 1433 docs lost. |
| Comment by Kristina Chodorow (Inactive) [ 01/Nov/11 ] |
|
I just realized you're using master/slave and the fixes we did only help replica sets. Let me look into master-slave and see if I can make an analogous fix. |
| Comment by Eliot Horowitz (Inactive) [ 30/Oct/11 ] |
|
Is it possible to give us a copy of your dataset? We can make this ticket private to do so. |
| Comment by Stone, Gao [ 30/Oct/11 ] |
|
Tried 2.0.1, first time 679 docs lost, second time 116 docs lost. I pray to God every time I sync a slave from the big collection db, hoping not losing data. |
| Comment by Kristina Chodorow (Inactive) [ 19/Oct/11 ] |
|
Yes. We also had a couple of other people run into the bug and try it on their own workloads and we're working on a more deterministic testing frameworks for intial syncing. |
| Comment by Stone, Gao [ 06/Oct/11 ] |
|
Waiting for the 2.0.1 release, hope it will fix this critical bug. Btw: how you guys test the bug? Do you have some big data set with a write(update) intensive master? |
| Comment by Kristina Chodorow (Inactive) [ 05/Oct/11 ] |
|
Any update on this? |
| Comment by Kristina Chodorow (Inactive) [ 23/Sep/11 ] |
|
These are the three major commits related to it: https://github.com/mongodb/mongo/commit/fa6ebc65bee94d2514f28afe5c6094f352dd28d3 Don't bother trying it the dev release until Monday, though. The build was broken last night so the last fix didn't make it into the binaries available for download. |
| Comment by Stone, Gao [ 23/Sep/11 ] |
|
can you point out which git commit (or issue #) that might fixed the bug ? |
| Comment by Kristina Chodorow (Inactive) [ 22/Sep/11 ] |
|
Can you try with the Development Release (Unstable) Nightly from http://www.mongodb.org/downloads? We fixed a bug this week that could have caused data loss on initial sync. |
| Comment by Stone, Gao [ 22/Sep/11 ] |
|
Can you guys verify this by creating a write intensive master with more that 100M docs, and sync a slave and see whether it losses data? It's so frequent that not losing data is such a rare case. |
| Comment by Stone, Gao [ 22/Sep/11 ] |
|
Will time effect replication? I noticed that time on master and slave are different. |
| Comment by Stone, Gao [ 22/Sep/11 ] |
|
I never tried delete mongod.lock. Using 1.8.3 currently, did experiment just now. 30k+ docs lost again. |
| Comment by Kristina Chodorow (Inactive) [ 06/Sep/11 ] |
|
The slave is either losing data because of a bug in MongoDB or an operator error on your end. We are trying to figure out which. Attaching any and all logs you have would help us do this. Have you ever deleted a mongod.lock file before starting any of the servers? (Please do not restrict your comments to be viewable to users.) |
| Comment by Stone, Gao [ 04/Sep/11 ] |
|
Tried with mongodb 1.8.3. Again 259 documents lost with the aforementioned collection and 3 documents lost in another collection (of total ~300k documents). How come slaves loses data frequently? |
| Comment by Eliot Horowitz (Inactive) [ 05/Jul/11 ] |
|
All logs |
| Comment by Stone, Gao [ 05/Jul/11 ] |
|
which log do you need ? log for slave 1 or master ? |
| Comment by Eliot Horowitz (Inactive) [ 05/Jul/11 ] |
|
Different versions aren't an issue. |
| Comment by Stone, Gao [ 05/Jul/11 ] |
|
Actually, I didn't do the switch until I wait for a long time hoping the slaves would catch up then gave up because there's no oplogs to apply anymore. The master(1.8.0) running with no journalling and no unclean shutdown. I setup slave1 (mongodb 1.8.2) last night, and there's no unclean shutdown. Will different versions of mongodb cause data loss with master/slave setup? |
| Comment by Eliot Horowitz (Inactive) [ 05/Jul/11 ] |
|
You'll need to provide some history to understand what happened. |