[SERVER-4248] Replica Set gets out of sync Created: 10/Nov/11 Updated: 11/Jul/16 Resolved: 11/Nov/11 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ben Wyrosdick | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | oplog, replication | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu 11.04, AWS m2.2xlarge |
||
| Operating System: | Linux |
| Participants: |
| Description |
|
We have seen an issue a few times on the same replica where the secondary is doing about 30-40k updates/sec and it has the state secondary still but can never catch up to master it just seems to stay the same amount behind master (normally just a few minutes). The primary has about 10 or less updates/sec so it doesn't seem to be an issue immediately of it doing too much for secondary to keep up. It has been fixed twice by stoping and starting the secondary and it will actually catch up after that normally. Also, when it happens the db.printReplicationInfo() will show a very small window of oplog time (< 30 min). The weird thing about it is that as time goes on if you keep checking that status the window is constantly growing but the first event time stays the same until the oplog grows large enough. So it is almost as if the oplog gets reset. PRIMARY> db.printReplicationInfo() PRIMARY> db.printReplicationInfo() |
| Comments |
| Comment by Scott Hernandez (Inactive) [ 14/Nov/11 ] |
|
That is expected. Every update is done by _id in the oplog and they are idempotent. You will get a fan-out like this for any multi-operation from the primary. |
| Comment by Ben Wyrosdick [ 14/Nov/11 ] |
|
I think we see the op that causes it. It looks like maybe a single query is translated into thousands of ops in the oplog and causes it to get behind. Is this expected? == Ruby Snippet == ) == Oplog Afterwards == , "h" : NumberLong("7367950549734090104"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e4") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-4941163361012609057"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e5") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-5476149061517939211"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e6") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("2149060704232662888"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e7") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-5960082855673500958"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e8") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-5660342605049353919"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1e9") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-2221591118617756225"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1ea") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("479178575086387690"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1eb") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("1454718424901059596"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1ec") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-2733836735551696413"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1ed") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("3406322233193789257"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1ee") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("8502987010612628908"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1ef") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-4418903082715263754"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f0") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-4547385937031701051"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f1") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("4469202679615264403"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f2") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-4092244317294379346"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f3") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("3498665780605362592"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f4") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-7208861788328173337"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f5") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("-1232670496862447715"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f6") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] } } } , "h" : NumberLong("7642858452721797360"), "op" : "u", "ns" : "xxxx_production.users", "o2" : { "_id" : ObjectId("4e9521f6722f7f0006e4e1f7") }, "o" : { "$pullAll" : { "unread_chats" : [ ObjectId("4ebfa07565ee19070b007eb6") ] }} } |
| Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ] |
|
resolving as main issue seems done |
| Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ] |
|
Yes use local |
| Comment by Ben Wyrosdick [ 11/Nov/11 ] |
|
can I query the oplog to see what the queries look like easily? |
| Comment by Ben Wyrosdick [ 11/Nov/11 ] |
|
yeah ... I agree after seeing it happen on staging. The thing that was throwing me off was that the oplog start time seemed to hang for hours and I didn't think that it was just so many writes at once that it didn't move. I will increase the oplog size. |
| Comment by Eliot Horowitz (Inactive) [ 11/Nov/11 ] |
|
Sounds like:
If that's the case, seems like pretty normal behavior. |
| Comment by Ben Wyrosdick [ 11/Nov/11 ] |
|
just to note ... the occurrence I posted about on staging acted a bit differently in that the first event in the oplog didn't stay the same like it had been on the production box, but having run the printReplicationInfo again on production just now it seems to have moved 4 seconds in the past few hours. This might possibly continue to point to some sort of huge burst in the oplog writes. PRIMARY> db.printReplicationInfo() |
| Comment by Ben Wyrosdick [ 11/Nov/11 ] |
|
It stays secondary ... it is actually acting up on one right now. PRIMARY> db.printReplicationInfo() SECONDARY> rs.status() , , , mongostat on primary: mongostat on secondary: Client says there was nothing going on at the time ... this one is their staging box. Looks like it is resyncing ... like perhaps the oplog rolls over but they claim they are not making that many changes and you can see from the primary that it doesn't do much. Can we inspect the oplog and look for large blasts of updates easily? While I was typing this out it got too stale: PRIMARY> rs.status() , , , So it looks like it might just be an issue of rolling the oplog ... even though updates look low. |
| Comment by Eliot Horowitz (Inactive) [ 10/Nov/11 ] |
|
Does the secondary go into recovering or stay secondary throughout? |