[SERVER-2464] master slave inconsistencies when running master in dur mode and restarting Created: 02/Feb/11  Updated: 31/Jan/12  Resolved: 31/Jan/12

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aaron Staple Assignee: Dwight Merriman
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File test.js    
Issue Links:
Depends
depends on SERVER-2506 reduce group commit latency Closed
Operating System: ALL
Participants:

 Description   

If there is heavy write locking on the master, the slave may receive data from the oplog before it is committed on the master.

1) Some data which is lost by the master on restart (because it was not committed) may be present on slave. I have observed this in testing.
2) The slave may record a syncedTo point that is uncommitted on the master. On restart the master will not return the expected op time to the slave and the slave will generate a sync exception forcing the need to reclone. I haven't seen this in testing, but it may be possible.



 Comments   
Comment by Aaron Staple [ 09/Feb/11 ]

Haven't analyzed the repl sets or sharding cases at all. Can do that if you want.

Comment by Dwight Merriman [ 09/Feb/11 ]

we can wait to replicate perhaps until the data is committed for master/slave (only, not for repl sets)

but we shouldn't make that change until we do some other optimizations that reduce the group commit latency.

Comment by Dwight Merriman [ 09/Feb/11 ]

so i suppose with replica sets, it's ok as the secondary having fresher data, it will take over?

so the problem is class master/slave only. (right?)

Comment by Aaron Staple [ 08/Feb/11 ]

Here is the script I used to trigger this.

Generated at Thu Feb 08 03:00:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.