[SERVER-9339] rollback files missing after rollback Created: 12/Apr/13 Updated: 11/Jul/16 Resolved: 23/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.0 |
| Fix Version/s: | 2.2.7, 2.4.4, 2.5.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Dmitry Klionsky | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux CentOS 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux |
||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: | Steps to reproduce (documents are simplified to give you the idea): 1. 10.10.0.222 is Primary, 10.10.0.223 is Secondary, and 10.10.0.155 is Arbiter. 2. Start writing 5000 requests and response documents , {$set:{req:ReqValue}}, true, false) , {$set:{resp:RespValue}}, true, false) 3. Kill 10.10.0.222 which is Primary 8. Start 10.10.0.223 without replSet and make export 10. Start 10.10.0.223 with replSet Now let's see the data. kelly:PRIMARY> db.mt_messages.count() kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}}).count() kelly:PRIMARY> db.mt_messages.find({rps:{$exists:false}}).count() kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}}) { "_id" : ObjectId("5167c605efb160392cf4fbc2"), "imi" : "11589144", "omi" : "11582862", "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.745Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb6"), "imi" : "11589157", "omi" : "11582851", "ri" : "25e77082-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.486Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbb"), "imi" : "11589154", "omi" : "11582857", "ri" : "25e6a206-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.583Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbc"), "imi" : "11589151", "omi" : "11582859", "ri" : "25e4cc38-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.623Z") } { "_id" : ObjectId("5167c604efb160392cf4fbba"), "imi" : "11589153", "omi" : "11582854", "ri" : "25e6233a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.522Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbe"), "imi" : "11589143", "omi" : "11582858", "ri" : "25dee606-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.647Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb8"), "imi" : "11589152", "omi" : "11582853", "ri" : "25e57e44-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.502Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbf"), "imi" : "11589137", "omi" : "11582865", "ri" : "25dd3b76-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.673Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe6"), "imi" : "11589102", "omi" : "11582905", "ri" : "25cf1b72-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.173Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe7"), "imi" : "11589105", "omi" : "11582906", "ri" : "25d0161c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.169Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe5"), "imi" : "11589113", "omi" : "11582897", "ri" : "25d3438c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.167Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe2"), "imi" : "11589114", "omi" : "11582895", "ri" : "25d371d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.151Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe0"), "imi" : "11589112", "omi" : "11582896", "ri" : "25d2f706-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.153Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe1"), "imi" : "11589118", "omi" : "11582892", "ri" : "25d4d4fe-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.148Z") } { "_id" : ObjectId("5167c605efb160392cf4fbde"), "imi" : "11589115", "omi" : "11582893", "ri" : "25d3be5c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.133Z") } { "_id" : ObjectId("5167c605efb160392cf4fbdf"), "imi" : "11589119", "omi" : "11582891", "ri" : "25d560b8-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.127Z") } { "_id" : ObjectId("5167c605efb160392cf4fbda"), "imi" : "11589116", "omi" : "11582889", "ri" : "25d43594-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.114Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd9"), "imi" : "11589127", "omi" : "11582885", "ri" : "25d8f070-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.113Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd7"), "imi" : "11589122", "omi" : "11582882", "ri" : "25d77d3a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.093Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd5"), "imi" : "11589131", "omi" : "11582869", "ri" : "25db5f2c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.090Z") }Type "it" for more We miss 31 updates. on 10.10.0.222 the response is registered. update: { $set: { omi: "11582862", rpt: new Date(1357149773745), rps: "success" }} nscanned:0 nupdated:1 fastmodinsert:1 keyUpdates:0 locks(micros) w:148 0ms on 10.10.0.223 the request is registered. update: { $set: { ct: "funnel", ci: "493b3678-9dc8-11e2-8cce-00269e42f7a5", ui: "user", gi: "7dc235d0-c938-4b66-8f8c-c9037c7eace7", t: "regular", e: "default", b: "User", sa: { a: "375296660002", t: 1, n: 1 }, da: { a: "375293615363", t: 1, n: 1 }, rd: false, rqt: new Date(1357149779760) } } nscanned:1 nmoved:1 nupdated:1 keyUpdates:0 locks(micros) w:433 0ms on 10.10.0.223 whole document is available according to the export file , "b" : "User", "ci" : "493b3678-9dc8-11e2-8cce-00269e42f7a5", "ct" : "funnel", "da" : { "a" : "375293615363", "t" : 1, "n" : 1 }, "e" : "default", "gi" : "7dc235d0-c938-4b66-8f8c-c9037c7eace7", "imi" : "11589144", "omi" : "11582862", "rd" : false, "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : { "$date" : 1357149773745 }, "rqt" : { "$date" : 1357149779760 }, "sa" : { "a" : "375296660002", "t" : 1, "n" : 1 }, "t" : "regular", "ui" : "user" } on 10.10.0.223 |
||||||||||||
| Participants: | |||||||||||||
| Description |
|
I have mongo replica set made up from 3 members: one Primary, one Secondary, and one Arbiter. There are located on 10.10.222, 10.10.0.223, and 10.10.0.155. kelly:PRIMARY> rs.status() , , ], They are started in interactive mode and killed by Ctrl-C, so unclear shutdown is not the case. Mongo version is 2.4.0 Linux 64bit with default settings. Database servers time is synchronized using NTP. All updatess are upserts with `safe' mode, i.e. {getLastError:1}which as documentation states also means {w:1}and all writes go to the Primary. By data loss I mean non-complete documents after applying rollbacks, for example {key:Value, req:ReqValue}. After analyzing many logs (application, mongod, mongostat) I realized that this happens when, for example, the request has been sent to 10.10.0.222 it was Primary initially, and the response has been sent to 10.10.0.223. The whole document is stored in the mongo export file. But after rollback on Y this response is gone. I can to reproduce this issue 10 out of 10 times. Client code is Erlang and it's able to sustain temporary connection failures. If there's a write failure then the write is repeated later. I can see that missing updates are registered in mongo log files and acknowledged by current Primary (at some poing I traced the erlang driver and saw the replies in Wireshark). The problem seems to be in the rollback mechanics. |
| Comments |
| Comment by auto [ 29/Aug/13 ] |
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: |
| Comment by auto [ 29/Aug/13 ] |
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: Conflicts: |
| Comment by auto [ 01/May/13 ] |
|
Author: {u'date': u'2013-04-30T15:58:46Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: |
| Comment by auto [ 30/Apr/13 ] |
|
Author: {u'date': u'2013-04-30T15:58:46Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: |
| Comment by auto [ 30/Apr/13 ] |
|
Author: {u'date': u'2013-04-22T18:48:02Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: Conflicts: |
| Comment by Eric Milkie [ 25/Apr/13 ] |
|
commit: 30b7ed6470b3c97b04e32b53137ece4314ee8ce8 |
| Comment by Dmitry Klionsky [ 12/Apr/13 ] |
|
My use case was to make sure that my app survives multiple failovers and continues to function as expected. >>The rollback directory generally shouldn't be a part of your normal workflow as you would have to deal with merge conflicts on the data, etc... >>It sounds like you should be using w=2 write concerns. |
| Comment by Eliot Horowitz (Inactive) [ 12/Apr/13 ] |
|
While we work on that, what is the use case you are working on? The rollback directory generally shouldn't be a part of your normal workflow as you would have to deal with merge conflicts on the data, etc... It sounds like you should be using w=2 write concerns. |
| Comment by Dmitry Klionsky [ 12/Apr/13 ] |
|
Right, it should, but it's not there. Double checked on two machines. Linux Mint 32 bit CentOS 64 bit |
| Comment by Eliot Horowitz (Inactive) [ 12/Apr/13 ] |
|
There should be a rollback directory in db2. Will see if I can reproduce it not being there. |
| Comment by Dmitry Klionsky [ 12/Apr/13 ] |
|
Here is it. Open two terminal tabs Tab1 and Tab2 Tab1 $ mkdir db1 db2 db3 Tab2 $ mongo localhost:1111/test test:PRIMARY> rs.addArb("mint-ws:3333") { "ok" : 1 }test:PRIMARY> rs.status() , , ], , {$set:{req:"req1"}}, true) Tab 1 $ kill %1 Tab 2 > db = connect("localhost:2222/test") , {$set:{rsp:"resp1"}}, true) Tab 1 $ kill %2 wait 1 min $ mongod --replSet test --dbpath ./db2 --logpath mongodb2.log --port 2222 & Tab 2 test:SECONDARY> db = connect("localhost:1111/test") Tab 1 $ cat ./db2/mongodb.log There's no `rollback' directory! The second upsert is gone! Am I missing something? |
| Comment by Eliot Horowitz (Inactive) [ 12/Apr/13 ] |
|
Do you have a script to do this or is it manual? Can you send code? I'm a little confused by some of the steps, will look again, but have you considered the following: T0: write to primary A have you checked the rollback file? Have you tried this with w=2? |
| Comment by Dmitry Klionsky [ 12/Apr/13 ] |
|
Step 5 should be `Kill 10.10.0.223 which is Primary' |