[SERVER-510] more real-time replication Created: 30/Dec/09  Updated: 12/Jul/16  Resolved: 02/Apr/10

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 1.5.0

Type: Improvement Priority: Major - P3
Reporter: Eliot Horowitz (Inactive) Assignee: Dwight Merriman
Resolution: Done Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-467 getLastError option to wait/block unt... Closed
Participants:

 Comments   
Comment by Eliot Horowitz (Inactive) [ 26/Apr/10 ]

in a release

Comment by Eliot Horowitz (Inactive) [ 14/Jan/10 ]

the mater and slave are under the same write load, since the slave has to apply all the writes if you want it current.

Comment by Raj Kadam [ 14/Jan/10 ]

I cannot read from the master as it is under write load and locked. The reads also incur locks so that means the writes backup.

Comment by Eliot Horowitz (Inactive) [ 14/Jan/10 ]

Those are fine if they are active 0.

The slave has a certain capacity, more writes on the master means more writes on the slave.

why are you reading from the slave and not just from the master?

The problem isn't with replication, just with load.
So there are 2 options:

  • optimize queries so load goes down
  • add more read slaves
Comment by Raj Kadam [ 14/Jan/10 ]

I posted this on the discussion group but I thought I would put a more extensive log here since this is dealing with replication. Version 1.2.0:

currentOp (unlocked):

{ opid: 0, active: 0, op: -1, ns: "", query: "", inLock: 1, client: "0.0.0.0:0" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 0, ns: "", query: "", inLock: 1, client: "48.49.51.59:12848" }


currentOp (unlocked):

{ opid: 276, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51194" }


currentOp (unlocked):

{ opid: 503, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51080" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 640429508, ns: "", query: "", inLock: 1, client: "0.0.0.0:0" }


currentOp (unlocked):

{ opid: 565, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:50870" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 1634427745, ns: "ges", query: "",&", inLock: 1, client: "59.38.35.51:50" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 1754712282, ns: "(", query: "", inLock: 1, client: "1.0.0.0:0" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 15068368, ns: "", query: "s", inLock: 1, client: "116.112.58.47:26740" }


currentOp (unlocked):

{ opid: 495, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51074" }


currentOp (unlocked):

{ opid: 546, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:50960" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 1714893363, ns: "-92b4-c36cf2979231", query: "rzt", inLock: 1, client: "99.97.114.105:24942" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 2048, ns: "", query: "1608;&", inLock: 1, client: "8.0.0.0:13824" }


currentOp (unlocked):

{ opid: 543, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51140" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 0, ns: "", query: "", inLock: 1, client: "0.0.0.0:57600" }


currentOp (unlocked):

{ opid: 628, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51038" }


currentOp (unlocked):

{ opid: 694, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51044" }


currentOp (unlocked):

{ opid: 637, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:50948" }


currentOp (unlocked):

{ opid: 690, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:50954" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 19278902, ns: "", query: "", inLock: 1, client: "0.0.0.0:0" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 926100787, ns: "99-811e-0bf0a7dc26d0", query: "desypratiwi", inLock: 1, client: "0.109.115.107:0" }


currentOp (unlocked):

{ opid: 703, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51020" }


currentOp (unlocked): { opid: 0, active: 0, op: -1, ns: "5;&", query: "lished_-1 864 { : "??????...", inLock: 1, client: "48.50.58.51:13344" }
currentOp (unlocked):

{ opid: 0, active: 0, op: 14792896, ns: "", query: "", inLock: 1, client: "3.0.0.0:25856" }


currentOp (unlocked):

{ opid: 502, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51092" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 0, ns: "", query: "ebowman Yeah, no bloody laptop sockets in economy hey!?", inLock: 1, client: "116.112.58.47:26740" }


currentOp (unlocked):

{ opid: 395, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51128" }


currentOp (unlocked):

{ opid: 288, active: 0, op: "insert", ns: "microblog.twitter", query: "", inLock: 1, client: "192.168.8.32:51248" }


currentOp (unlocked):

{ opid: 0, active: 0, op: 0, ns: "", query: "", inLock: 1, client: "0.0.0.0:0" }

Could all these bogus repl entries be impacting my performance in replication? Anything that does not have 192.168.8.32 is bogus.

Comment by Raj Kadam [ 14/Jan/10 ]

That might be true but the reverse also applies. If there are writes that lock the master db a lot, I tried it as low as 200 writes a second, the slave seems to fall behind. Right now I am running a test where the slave is about 8 hours behind.

Thu Jan 14 01:53:55 connection accepted from 192.168.8.38:42875 #2082
Thu Jan 14 01:53:56 end connection 192.168.8.38:42875
Thu Jan 14 01:55:01 repl: checkpoint applied 825 operations
Thu Jan 14 01:55:01 repl: syncedTo: Wed Jan 13 15:27:08 2010 4b4e56cc:1e
Thu Jan 14 01:56:02 repl: checkpoint applied 1005 operations
Thu Jan 14 01:56:02 repl: syncedTo: Wed Jan 13 15:28:05 2010 4b4e5705:7

It is Thursday, Jan 14th 2:05 AM. Unless I turn off the writes, I do not see how the slave will ever catch up. I am hoping, thought I could be wrong, that the read/write locks in 1.3.0 solve this issue where the slave can at least read the oplog while the master is in write long. My guess is right now that is what is causing this,

Comment by Eliot Horowitz (Inactive) [ 13/Jan/10 ]

There isn't really anyway to make that better unless we prioritize replication over regular queries, which is not what most people want.

So in your case you probably just need 2 slaves for more read capacity. (After making sure things are indexed correctly, etc...)

Comment by Raj Kadam [ 10/Jan/10 ]

Yes, it does catch up, but takes awhile to do it. Does replication use the entire line, it seems to do it in bursts? In any case, yes the system is doing a lot of queries at a time, so under heavy load it can fall behind, but if you stop the load it does catch back up. Just feel like that can be more isolated because a fall behind is not only lag, its lag + the time it takes to catch back up:/

Comment by Dwight Merriman [ 10/Jan/10 ]

there's no explicit prioritization of replication over other queries on the master, so if it is overloaded it would fall behind.

it's also possible with v1.2 and lower that an extremely long running query is making it fall behind

if there are no queries happening, i take it things keep up?

in general we find replication keeps up in most deployments so thus trying to figure out the scenario.

Comment by Raj Kadam [ 07/Jan/10 ]

If you lock the db a lot doing a lot of read queries with cursors, or maybe it is just load due to the high amount of reads, the slave falls behind, 2000 seconds or more.

Comment by Dwight Merriman [ 06/Jan/10 ]

can you provide more detail

are you saying the slave lags by say, 3 seconds, or that it falls way, way behind?

Comment by Raj Kadam [ 06/Jan/10 ]

I hope this fixes the issue when slaves are under high read load that replication falls behind. At this point, with 1.20, Mongo really has no replication.

Generated at Thu Feb 08 02:54:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.