[SERVER-35786] lastDurable optime should be updated after batch application on non-durable storage engines Created: 25/Jun/18  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-22728 if journaling is disabled, update dur... Closed
related to SERVER-37943 Enable replica set transactions for t... Closed
related to SERVER-38685 Startup warning if In-Memory SE is us... Closed
Assigned Teams:
Replication
Operating System: ALL
Sprint: Repl 2018-07-30, Repl 2018-08-13, Repl 2018-08-27, Repl 2018-09-10
Participants:
Linked BF Score: 17

 Description   

For secondary batch application, the ApplyBatchFinalizer is used to advance optimes after application of an oplog batch completes. We currently use the ApplyBatchFinalizerForJournal for durable storage engines and the ApplyBatchFinalizer for non-durable storage engines, which only updates the lastApplied optime. On primaries, for non-durable storage engines, the replication system keeps the lastDurable optime up to date wth the lastApplied optime, since the lastDurable optime has no functional meaning on a non-durable storage engine. It seems we should be keeping this behavior consistent between primaries and secondaries, so we should update the lastDurable optime on batch application on non-durable storage engines.



 Comments   
Comment by Spencer Brody (Inactive) [ 29/Aug/18 ]

Spent some time investigating the current behavior here and it's a bit interesting. The following describes the behavior of a single inserts with various configurations and writeConcerns specified

Primary inMemory, Secondary WT, writeConcernMajorityJournalDefault: true

  1. No writeConcern specified: No error, lastApplied updated but not lastDurable on the primary, both lastApplied and lastDurable updated on secondary
  2. w:majority without 'j' specified: No error, both lastApplied and lastDurable advance on both nodes.
  3. w:majority with j: true specified explicitly: Error "cannot use 'j' option when a host does not have journaling enabled" returned.  No write is performed so no optimes advance.
  4. w:majority with j:false specified: write is successful, writeConcern times out.  lastApplied is updated but not lastDurable on primary, both updated on secondary.

#2 here I believe can be explained by this line, which waits for durability and then unconditionally sets lastOpDurable to the lastApplied.  This seems like problematic behavior however since j wasn't specified and writeConcernMajorityJournalDefault is true I'd expect this to error in some way.
#4 above is pretty surprising, and I haven't looked into yet why this is the behavior.

 
Primary WT, Secondary inMemory, writeConcernMajorityJournalDefault: true

  1. No writeConcern specified: No error, both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  2. w:majority without 'j' specified: Write is successful, writeConcern times out. Both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  3. w:majority with j: true specified explicitly: Write is successful, writeConcern times out. Both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  4. w:majority with j:false specified: Write is successful, writeConcern times out. Both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary

All 4 cases behave the same. It's a bit surprising that #4 still errors even though j:false is specified.

 
Primary WT, Secondary inMemory, writeConcernMajorityJournalDefault: false

  1. No writeConcern specified: No error, both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  2. w:majority without 'j' specified: No error, both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  3. w:majority with j: true specified explicitly: Write is successful, writeConcern times out. Both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary
  4. w:majority with j:false specified: No error, both OpTimes advanced on primary, lastApplied but not lastDurable updated on secondary

No surprises here

 
Primary inMemory, Secondary WT, writeConcernMajorityJournalDefault: false

  1. No writeConcern specified: No error, lastApplied updated but not lastDurable on the primary, both lastApplied and lastDurable updated on secondary
  2. w:majority without 'j' specified: No error, lastApplied updated but not lastDurable on the primary, both lastApplied and lastDurable updated on secondary
  3. w:majority with j: true specified explicitly: Error "cannot use 'j' option when a host does not have journaling enabled" returned.  No write is performed so no optimes advance.
  4. w:majority with j:false specified: No error, lastApplied updated but not lastDurable on the primary, both lastApplied and lastDurable updated on secondary

No surprises here.

Comment by Tess Avitabile (Inactive) [ 03/Jul/18 ]

We should investigate whether having an in-memory primary node keep lastDurable up to date with lastApplied causes us to incorrectly confirm majority (durable) writes.

Comment by William Schultz (Inactive) [ 25/Jun/18 ]

milkie I came across this behavior when diagnosing a build failure that occurred specifically on the ephemeralForTest storage engine. What I observed was that the lastDurable optime on a secondary was not advancing during normal steady state replication, but an update to it was later triggered by another (internal) write happening in the system; in this case it was the writing of our "last vote" document to storage. This then seemed to cause the durable optime to advance, and because of this, we triggered an updatePosition request to our sync source, which ended up interfering with other commands in an unintended way (SERVER-35766). That isn't explicitly related to this issue, but that is how I discovered this. When I noticed that we weren't updating our lastDurable optime during batch application, it seemed incorrect. Perhaps the behavior I was observing could also be due to an ephemeralForTest engine bug? I wasn't entirely sure.

Maybe the existing behavior is acceptable, but I suppose we should at least decide what we want the behavior to be, since it certainly appears that we try to keep lastDurable optimes up to date with lastApplied optimes on the primary.

Comment by Eric Milkie [ 25/Jun/18 ]

I'm not sure we should be making this change unless there's an advantage to making it. I suspect it won't be trivial to change this behavior, and it will change the use of writeConcernMajorityJournalDefault parameter, since you would no longer need to change it when setting up a replica set with nondurable nodes.

Generated at Thu Feb 08 04:40:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.