Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-9339

rollback files missing after rollback

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 2.2.7, 2.4.4, 2.5.0
    • Affects Version/s: 2.4.0
    • Component/s: Replication
    • None
    • Environment:
      Linux CentOS 2.6.32-358.2.1.el6.x86_64 #1 SMP Wed Mar 13 00:26:49 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
    • ALL
    • Hide

      Steps to reproduce (documents are simplified to give you the idea):

      1. 10.10.0.222 is Primary, 10.10.0.223 is Secondary, and 10.10.0.155 is Arbiter.

      2. Start writing 5000 requests and response documents
      Requests are like:
      db.mt_messages.update(

      {key:Value}

      , {$set:{req:ReqValue}}, true, false)
      Responses are like:
      db.mt_messages.update(

      {key:Value}

      , {$set:{resp:RespValue}}, true, false)
      Both requests and responses are upserts to when they are both received the documents should look like:

      {key:Value, req:ReqValue, resp:RespValue}

      3. Kill 10.10.0.222 which is Primary
      4. 10.10.0.223 becomes Primary
      5. 10.10.0.223 Y which is Primary
      6. Start 10.10.0.222 with replSet
      7. 10.10.0.222 becomes Primary

      8. Start 10.10.0.223 without replSet and make export
      $ mongoexport -d kelly_2013_01_01 -c mt_messages -o mt_messages.json
      9. Stop 10.10.0.223

      10. Start 10.10.0.223 with replSet
      11. 10.10.0.223 rollbacks as expected and becomes Secondary
      12. After applying all rollbacks I end up with lost data even though the missing data exist in the export file mt_messages.json.

      Now let's see the data.

      kelly:PRIMARY> db.mt_messages.count()
      5000

      kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}}).count()
      31

      kelly:PRIMARY> db.mt_messages.find({rps:{$exists:false}}).count()
      5000

      kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}})

      { "_id" : ObjectId("5167c605efb160392cf4fbc2"), "imi" : "11589144", "omi" : "11582862", "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.745Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb6"), "imi" : "11589157", "omi" : "11582851", "ri" : "25e77082-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.486Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbb"), "imi" : "11589154", "omi" : "11582857", "ri" : "25e6a206-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.583Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbc"), "imi" : "11589151", "omi" : "11582859", "ri" : "25e4cc38-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.623Z") } { "_id" : ObjectId("5167c604efb160392cf4fbba"), "imi" : "11589153", "omi" : "11582854", "ri" : "25e6233a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.522Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbe"), "imi" : "11589143", "omi" : "11582858", "ri" : "25dee606-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.647Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb8"), "imi" : "11589152", "omi" : "11582853", "ri" : "25e57e44-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.502Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbf"), "imi" : "11589137", "omi" : "11582865", "ri" : "25dd3b76-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.673Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe6"), "imi" : "11589102", "omi" : "11582905", "ri" : "25cf1b72-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.173Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe7"), "imi" : "11589105", "omi" : "11582906", "ri" : "25d0161c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.169Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe5"), "imi" : "11589113", "omi" : "11582897", "ri" : "25d3438c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.167Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe2"), "imi" : "11589114", "omi" : "11582895", "ri" : "25d371d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.151Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe0"), "imi" : "11589112", "omi" : "11582896", "ri" : "25d2f706-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.153Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe1"), "imi" : "11589118", "omi" : "11582892", "ri" : "25d4d4fe-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.148Z") } { "_id" : ObjectId("5167c605efb160392cf4fbde"), "imi" : "11589115", "omi" : "11582893", "ri" : "25d3be5c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.133Z") } { "_id" : ObjectId("5167c605efb160392cf4fbdf"), "imi" : "11589119", "omi" : "11582891", "ri" : "25d560b8-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.127Z") } { "_id" : ObjectId("5167c605efb160392cf4fbda"), "imi" : "11589116", "omi" : "11582889", "ri" : "25d43594-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.114Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd9"), "imi" : "11589127", "omi" : "11582885", "ri" : "25d8f070-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.113Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd7"), "imi" : "11589122", "omi" : "11582882", "ri" : "25d77d3a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.093Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd5"), "imi" : "11589131", "omi" : "11582869", "ri" : "25db5f2c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.090Z") }

      Type "it" for more
      kelly:PRIMARY>

      We miss 31 updates.
      Search the first "imi":"11589144" which is unique enough for this purpose.

      on 10.10.0.222 the response is registered.
      $ grep 11589144 /var/log/mongodb.log
      Fri Apr 12 11:29:57.026 [conn29] update kelly_2013_01_01.mt_messages query:

      { ri: "25dfe2d6-a34b-11e2-994b-000c29a75b4d", imi: "11589144" }

      update: { $set:

      { omi: "11582862", rpt: new Date(1357149773745), rps: "success" }

      } nscanned:0 nupdated:1 fastmodinsert:1 keyUpdates:0 locks(micros) w:148 0ms

      on 10.10.0.223 the request is registered.
      $ grep 11589144 /var/log/mongodb.log
      Fri Apr 12 11:30:12.765 [conn18] update kelly_2013_01_01.mt_messages query:

      { ri: "25dfe2d6-a34b-11e2-994b-000c29a75b4d", imi: "11589144" }

      update: { $set: { ct: "funnel", ci: "493b3678-9dc8-11e2-8cce-00269e42f7a5", ui: "user", gi: "7dc235d0-c938-4b66-8f8c-c9037c7eace7", t: "regular", e: "default", b: "User", sa:

      { a: "375296660002", t: 1, n: 1 }

      , da:

      { a: "375293615363", t: 1, n: 1 }

      , rd: false, rqt: new Date(1357149779760) } } nscanned:1 nmoved:1 nupdated:1 keyUpdates:0 locks(micros) w:433 0ms

      on 10.10.0.223 whole document is available according to the export file
      $ grep 11589144 mt_messages.json
      { "_id" :

      { "$oid" : "5167c605efb160392cf4fbc2" }

      , "b" : "User", "ci" : "493b3678-9dc8-11e2-8cce-00269e42f7a5", "ct" : "funnel", "da" :

      { "a" : "375293615363", "t" : 1, "n" : 1 }

      , "e" : "default", "gi" : "7dc235d0-c938-4b66-8f8c-c9037c7eace7", "imi" : "11589144", "omi" : "11582862", "rd" : false, "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" :

      { "$date" : 1357149773745 }

      , "rqt" :

      { "$date" : 1357149779760 }

      , "sa" :

      { "a" : "375296660002", "t" : 1, "n" : 1 }

      , "t" : "regular", "ui" : "user" }

      on 10.10.0.223
      $ bsondump kelly_2013_01_01.mt_messages.2013-04-12T08-31-25.0.bson 2>/dev/null | grep 11589144
      IT'S NOT HERE! IT'S GONE!

      Show
      Steps to reproduce (documents are simplified to give you the idea): 1. 10.10.0.222 is Primary, 10.10.0.223 is Secondary, and 10.10.0.155 is Arbiter. 2. Start writing 5000 requests and response documents Requests are like: db.mt_messages.update( {key:Value} , {$set:{req:ReqValue}}, true, false) Responses are like: db.mt_messages.update( {key:Value} , {$set:{resp:RespValue}}, true, false) Both requests and responses are upserts to when they are both received the documents should look like: {key:Value, req:ReqValue, resp:RespValue} 3. Kill 10.10.0.222 which is Primary 4. 10.10.0.223 becomes Primary 5. 10.10.0.223 Y which is Primary 6. Start 10.10.0.222 with replSet 7. 10.10.0.222 becomes Primary 8. Start 10.10.0.223 without replSet and make export $ mongoexport -d kelly_2013_01_01 -c mt_messages -o mt_messages.json 9. Stop 10.10.0.223 10. Start 10.10.0.223 with replSet 11. 10.10.0.223 rollbacks as expected and becomes Secondary 12. After applying all rollbacks I end up with lost data even though the missing data exist in the export file mt_messages.json. Now let's see the data. kelly:PRIMARY> db.mt_messages.count() 5000 kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}}).count() 31 kelly:PRIMARY> db.mt_messages.find({rps:{$exists:false}}).count() 5000 kelly:PRIMARY> db.mt_messages.find({rqt:{$exists:false}}) { "_id" : ObjectId("5167c605efb160392cf4fbc2"), "imi" : "11589144", "omi" : "11582862", "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.745Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb6"), "imi" : "11589157", "omi" : "11582851", "ri" : "25e77082-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.486Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbb"), "imi" : "11589154", "omi" : "11582857", "ri" : "25e6a206-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.583Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbc"), "imi" : "11589151", "omi" : "11582859", "ri" : "25e4cc38-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.623Z") } { "_id" : ObjectId("5167c604efb160392cf4fbba"), "imi" : "11589153", "omi" : "11582854", "ri" : "25e6233a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.522Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbe"), "imi" : "11589143", "omi" : "11582858", "ri" : "25dee606-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.647Z") } { "_id" : ObjectId("5167c604efb160392cf4fbb8"), "imi" : "11589152", "omi" : "11582853", "ri" : "25e57e44-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.502Z") } { "_id" : ObjectId("5167c604efb160392cf4fbbf"), "imi" : "11589137", "omi" : "11582865", "ri" : "25dd3b76-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:53.673Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe6"), "imi" : "11589102", "omi" : "11582905", "ri" : "25cf1b72-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.173Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe7"), "imi" : "11589105", "omi" : "11582906", "ri" : "25d0161c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.169Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe5"), "imi" : "11589113", "omi" : "11582897", "ri" : "25d3438c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.167Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe2"), "imi" : "11589114", "omi" : "11582895", "ri" : "25d371d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.151Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe0"), "imi" : "11589112", "omi" : "11582896", "ri" : "25d2f706-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.153Z") } { "_id" : ObjectId("5167c605efb160392cf4fbe1"), "imi" : "11589118", "omi" : "11582892", "ri" : "25d4d4fe-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.148Z") } { "_id" : ObjectId("5167c605efb160392cf4fbde"), "imi" : "11589115", "omi" : "11582893", "ri" : "25d3be5c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.133Z") } { "_id" : ObjectId("5167c605efb160392cf4fbdf"), "imi" : "11589119", "omi" : "11582891", "ri" : "25d560b8-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.127Z") } { "_id" : ObjectId("5167c605efb160392cf4fbda"), "imi" : "11589116", "omi" : "11582889", "ri" : "25d43594-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.114Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd9"), "imi" : "11589127", "omi" : "11582885", "ri" : "25d8f070-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.113Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd7"), "imi" : "11589122", "omi" : "11582882", "ri" : "25d77d3a-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.093Z") } { "_id" : ObjectId("5167c605efb160392cf4fbd5"), "imi" : "11589131", "omi" : "11582869", "ri" : "25db5f2c-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : ISODate("2013-01-02T18:02:54.090Z") } Type "it" for more kelly:PRIMARY> We miss 31 updates. Search the first "imi":"11589144" which is unique enough for this purpose. on 10.10.0.222 the response is registered. $ grep 11589144 /var/log/mongodb.log Fri Apr 12 11:29:57.026 [conn29] update kelly_2013_01_01.mt_messages query: { ri: "25dfe2d6-a34b-11e2-994b-000c29a75b4d", imi: "11589144" } update: { $set: { omi: "11582862", rpt: new Date(1357149773745), rps: "success" } } nscanned:0 nupdated:1 fastmodinsert:1 keyUpdates:0 locks(micros) w:148 0ms on 10.10.0.223 the request is registered. $ grep 11589144 /var/log/mongodb.log Fri Apr 12 11:30:12.765 [conn18] update kelly_2013_01_01.mt_messages query: { ri: "25dfe2d6-a34b-11e2-994b-000c29a75b4d", imi: "11589144" } update: { $set: { ct: "funnel", ci: "493b3678-9dc8-11e2-8cce-00269e42f7a5", ui: "user", gi: "7dc235d0-c938-4b66-8f8c-c9037c7eace7", t: "regular", e: "default", b: "User", sa: { a: "375296660002", t: 1, n: 1 } , da: { a: "375293615363", t: 1, n: 1 } , rd: false, rqt: new Date(1357149779760) } } nscanned:1 nmoved:1 nupdated:1 keyUpdates:0 locks(micros) w:433 0ms on 10.10.0.223 whole document is available according to the export file $ grep 11589144 mt_messages.json { "_id" : { "$oid" : "5167c605efb160392cf4fbc2" } , "b" : "User", "ci" : "493b3678-9dc8-11e2-8cce-00269e42f7a5", "ct" : "funnel", "da" : { "a" : "375293615363", "t" : 1, "n" : 1 } , "e" : "default", "gi" : "7dc235d0-c938-4b66-8f8c-c9037c7eace7", "imi" : "11589144", "omi" : "11582862", "rd" : false, "ri" : "25dfe2d6-a34b-11e2-994b-000c29a75b4d", "rps" : "success", "rpt" : { "$date" : 1357149773745 } , "rqt" : { "$date" : 1357149779760 } , "sa" : { "a" : "375296660002", "t" : 1, "n" : 1 } , "t" : "regular", "ui" : "user" } on 10.10.0.223 $ bsondump kelly_2013_01_01.mt_messages.2013-04-12T08-31-25.0.bson 2>/dev/null | grep 11589144 IT'S NOT HERE! IT'S GONE!

      I have mongo replica set made up from 3 members: one Primary, one Secondary, and one Arbiter. There are located on 10.10.222, 10.10.0.223, and 10.10.0.155.

      kelly:PRIMARY> rs.status()
      {
      "set" : "kelly",
      "date" : ISODate("2013-04-12T08:28:41Z"),
      "myState" : 1,
      "members" : [
      {
      "_id" : 0,
      "name" : "10.10.0.222:27017",
      "health" : 1,
      "state" : 1,
      "stateStr" : "PRIMARY",
      "uptime" : 34,
      "optime" :

      { "t" : 1365755246, "i" : 2767 }

      ,
      "optimeDate" : ISODate("2013-04-12T08:27:26Z"),
      "self" : true
      },
      {
      "_id" : 1,
      "name" : "10.10.0.223:27017",
      "health" : 1,
      "state" : 2,
      "stateStr" : "SECONDARY",
      "uptime" : 24,
      "optime" :

      { "t" : 1365755246, "i" : 2767 }

      ,
      "optimeDate" : ISODate("2013-04-12T08:27:26Z"),
      "lastHeartbeat" : ISODate("2013-04-12T08:28:39Z"),
      "lastHeartbeatRecv" : ISODate("2013-04-12T08:28:40Z"),
      "pingMs" : 0,
      "lastHeartbeatMessage" : "syncing to: 10.10.0.222:27017"
      },

      { "_id" : 2, "name" : "10.10.0.155:27017", "health" : 1, "state" : 7, "stateStr" : "ARBITER", "uptime" : 34, "lastHeartbeat" : ISODate("2013-04-12T08:28:39Z"), "lastHeartbeatRecv" : ISODate("2013-04-12T08:28:40Z"), "pingMs" : 0 }

      ],
      "ok" : 1
      }

      They are started in interactive mode and killed by Ctrl-C, so unclear shutdown is not the case. Mongo version is 2.4.0 Linux 64bit with default settings. Database servers time is synchronized using NTP.

      All updatess are upserts with `safe' mode, i.e.

      {getLastError:1}

      which as documentation states also means

      {w:1}

      and all writes go to the Primary.

      By data loss I mean non-complete documents after applying rollbacks, for example

      {key:Value, req:ReqValue}

      . After analyzing many logs (application, mongod, mongostat) I realized that this happens when, for example, the request has been sent to 10.10.0.222 it was Primary initially, and the response has been sent to 10.10.0.223. The whole document is stored in the mongo export file. But after rollback on Y this response is gone.

      I can to reproduce this issue 10 out of 10 times. Client code is Erlang and it's able to sustain temporary connection failures. If there's a write failure then the write is repeated later. I can see that missing updates are registered in mongo log files and acknowledged by current Primary (at some poing I traced the erlang driver and saw the replies in Wireshark). The problem seems to be in the rollback mechanics.

        1. rollback5.js
          3 kB
        2. mt_messages.json
          907 kB
        3. mongostat-10.10.0.223.log
          67 kB
        4. mongostat-10.10.0.222.log
          66 kB
        5. mongodb-10.10.0.223.log
          2.44 MB
        6. mongodb-10.10.0.222.log
          9.62 MB
        7. kelly_2013_01_01.mt_messages.2013-04-12T08-31-25.0.bson
          367 kB

            Assignee:
            milkie@mongodb.com Eric Milkie
            Reporter:
            ten0s Dmitry Klionsky
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

              Created:
              Updated:
              Resolved: