[SERVER-3096] replicaet data integrity fail Created: 16/May/11 Updated: 29/Aug/11 Resolved: 04/Aug/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | MartinS | Assignee: | Mathias Stearn |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
linux (openvz kernel without limit on vhosts, vserver kernel, native linux kernel), mongodb1.8 |
||
| Operating System: | Linux |
| Participants: |
| Description |
|
Critical bug in replica set synchronisation. I've runned repairDatabase() to compact database files on my replSet primary host(HOST1). When I check http://ANYHOST:28017/_replSet I found that optime on all nodes is the same. After logging to mongo cli and running command "show dbs"; I've seen that two first dbs was showned as "empty". |
| Comments |
| Comment by Mathias Stearn [ 04/Aug/11 ] |
|
Please reopen if you have more information to help us track this down |
| Comment by Mathias Stearn [ 21/Jun/11 ] |
|
Do you still have the log file from when mongod crashed with --journal? Have you seen this with 1.8.2? |
| Comment by MartinS [ 18/May/11 ] |
|
Oh, and one more - after crash failcounter in openvz UBC was not incremented. Bug in |
| Comment by MartinS [ 16/May/11 ] |
|
> Which version of mongo exactly? 1.8.1 > Also, there are some known issues on openvz with data sets larger than ram due to an openvz issue. I know something about this. I've tested it. This bug affect only virtualhost with memory limit set, I don't use theese limit. Server has exausted comletely host RAM and swap. I also know this bug with memory limit's doesn't affect vserver virtualisation (where memory limits are above 4GB). |
| Comment by Eliot Horowitz (Inactive) [ 16/May/11 ] |
|
Which version of mongo exactly? |
| Comment by MartinS [ 16/May/11 ] |
|
Maybe. Unfortunately I can't force this behaviour again Oh, and this could be important: first two databases: admin and x were empty (both has 200MB in 'show dbs;'), but i've ran repairDatabase(); on the third database, bigger than those two. I regret that I haven't made snapshot of the filesystem after crash :/ |
| Comment by Kristina Chodorow (Inactive) [ 16/May/11 ] |
|
Sounds like a journaling issue? |
| Comment by Kristina Chodorow (Inactive) [ 16/May/11 ] |
|
> I'm using --journal already and this crash take place exactly in this configurations. After restart two databases are emtpy. Oh, okay, that's not good. |
| Comment by MartinS [ 16/May/11 ] |
|
> in the logfile there is only: nothing else. After restart in logfile are lines like: [conn$ID] assertion 10057 unauthorized db:dbname lock type:-1 client:IP ns:database.system.namespaces query:{} >> Server should never became primary host in this situation. I'm using --journal already and this crash take place exactly in this configurations. After restart two databases are emtpy. |
| Comment by Kristina Chodorow (Inactive) [ 16/May/11 ] |
|
> I'd rather to report a bug. MongoDB server should throw exception and handle it istead of crashing. Agreed. > In logfile i didnt found any messages about running out of resources. Can you paste the end of your log here? (Last couple hundred lines or so, the more the better.) > Server should never became primary host in this situation. MongoDB cannot tell if its data is corrupt or not without checking each record, which is why you can't restart it without deleting the mongod.lock file first. If you delete the lock file and restart, you're essentially telling MongoDB, "Even though you think this data might be corrupt, I know it isn't. Use it anyway." If you don't want this situation, don't delete the mongod.lock file or use --journal. |
| Comment by MartinS [ 16/May/11 ] |
|
> This is generally a good sign, they should be the same or very close... I'm not looking for temporary solution for this problem. I done this 3 steps immediately after crash. I'd rather to report a bug. MongoDB server should throw exception and handle it istead of crashing. Server should never became primary host in this situation. |
| Comment by Kristina Chodorow (Inactive) [ 16/May/11 ] |
|
> When I check http://ANYHOST:28017/_replSet I found that optime on all nodes is the same. This is generally a good sign, they should be the same or very close... It sounds like HOST1 is corrupt, probably from running out of resources on the repair (did it crash?). On the plus side, it's very unlikely that replication will replicate corruption, so the best thing would be to: 1) connect to HOST2 |