[SERVER-32129] Investigate if initial sync workload is actually timing initial sync or write load Created: 22/Nov/17  Updated: 30/Oct/23  Resolved: 11/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.1

Type: Task Priority: Major - P3
Reporter: Judah Schvimer Assignee: David Daly
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
Backwards Compatibility: Fully Compatible
Participants:

 Description   

The write load takes up almost the entirety of the test. It's very likely that the test is timing the write load and that the initial sync finished much earlier.

[2017/11/21 11:33:24.538] Adding last node
[2017/11/21 11:33:24.538] Write load started
[2017/11/21 11:33:24.538] connecting to: mongodb://10.2.0.200:27017/
[2017/11/21 11:33:24.538] MongoDB server version: 3.7.0-63-ga7064e0
[2017/11/21 11:36:55.415] Write load finished
...
[2017/11/21 11:36:58.419] Third node now successfully in SECONDARY state.
[2017/11/21 11:36:58.419] Tue Nov 21 2017 11:36:58 GMT+0000 (UTC)
[2017/11/21 11:36:58.419] Total time for initial sync: 213.919 seconds, or 213.915 seconds
[2017/11/21 11:36:58.460] >>> initialsync_dbs_1_colls_1_writeload_true : 24508.715915837303 1



 Comments   
Comment by David Daly [ 11/Jul/18 ]

Yes, I think so. Thanks for poking. I'll resolve.

Comment by Tess Avitabile (Inactive) [ 11/Jul/18 ]

david.daly, is this work complete?

Comment by Spencer Brody (Inactive) [ 29/May/18 ]

david.daly, yeah that sounds good. Thanks!

Comment by David Daly [ 09/May/18 ]

spencer let's figure out what the work is, and then decide. I'd propose moving the insert workload into a separate thread, and adding an assert that waitForInitialSyncFinish actually has to wait (i.e., it's called before the initial sync is done). Does that sounds like a plan? If so, I think that could be a solitary perf ticket. 

Comment by Spencer Brody (Inactive) [ 09/May/18 ]

david.daly, we are planning work to improve initial sync performance in 4.2.  This seems like something we likely want to address before we do that.  Is this something the replication team should plan to budget into our 'faster initial sync' project, or can the PERF team pick this up?

Comment by Judah Schvimer [ 19/Dec/17 ]

A separate thread could work, though it would have a variable number of inserts. Just making the number of inserts smaller might be better.

Comment by David Daly [ 27/Nov/17 ]

Which w:3 write are you referring to? The workload uses the "waitForMemberState" function to determine when initial sync finishes. I notice now that we wait for replication to finish even after we're in "SECONDARY" state but before we stop the initial sync timer, which is not technically part of initial sync (though likely adds very little time). We should move that waiting to after we stop the timer.

Sorry – I was referring to a w:3 write that was replaced about 1.5 years ago.

Comment by Judah Schvimer [ 27/Nov/17 ]

https://evergreen.mongodb.com/task/sys_perf_linux_3_node_replSet_initialsync_initialsync_WT_a7064e087959693534b771962a1b3ad413144240_17_11_20_13_40_31##%257B%2522perftab%2522%253A1%257D

Is it possible that with the background load the initial sync never completes until the write load completes?

It's certainly possible, and I'd have to spend more time digging into the mongod logs to investigate, but I think that would mean even a slight improvement in initial sync performance would make the initial sync complete before the write load finished and we'd miss the performance improvement due to this bug. Since the write load is insert-only, once we finish cloning all of the data, the initial sync should not need to apply any new operations that come in, and should only need to apply the write operations it's already seen.

Or alternatively that the w:3 write gets completely starved by the background load?

Which w:3 write are you referring to? The workload uses the "waitForMemberState" function to determine when initial sync finishes. I notice now that we wait for replication to finish even after we're in "SECONDARY" state but before we stop the initial sync timer, which is not technically part of initial sync (though likely adds very little time). We should move that waiting to after we stop the timer.

Comment by David Daly [ 27/Nov/17 ]

judah.schvimer could you add a link to the run?

Is it possible that with the background load the initial sync never completes until the write load completes? Or alternatively that the w:3 write gets completely starved by the background load?

Generated at Thu Feb 08 04:29:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.