[SERVER-26753] Minor speed regression (13%) and 'choppy' performance in 3.4 vs 3.2 Created: 24/Oct/16  Updated: 23/Nov/16  Resolved: 07/Nov/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.4.0-rc1
Fix Version/s: 3.2.11, 3.4.0-rc3

Type: Task Priority: Minor - P4
Reporter: Nick Judson Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 3.2.10.metrics.2016-10-26T15-56-24Z-00000     File 3.2.9-high-cpu.metrics.2016-11-04T16-46-06Z-00000     File 3.2.metrics.2016-10-24T19-17-12Z-00000     File 3.4.metrics.2016-10-24T19-06-15Z-00000     PNG File Custom Build.png     File high-cpu.metrics.2016-11-04T16-20-56Z-00000     File high-cpu2.metrics.2016-11-04T16-37-01Z-00000     PNG File mongo.3.2.10.png     PNG File mongo.3.2.png     PNG File mongo3.4.png    
Backwards Compatibility: Fully Compatible
Backport Completed:
Sprint: Storage 2016-11-21
Participants:

 Description   

OS is Win2012R2.

When comparing my workload on the 3.4rc1 vs 3.2.10 there is a minor speed regression (13%) which appears to be mainly due to some intermittent activity (eviction?).

I see the cpu spike to 100% every ~10-20 seconds with a corresponding speed drop. In 3.2, the cpu stays relatively flat.

This is minor but thought you might like to know.



 Comments   
Comment by Githook User [ 07/Nov/16 ]

Author:

{u'name': u'Ramon Fernandez', u'email': u'ramon@mongodb.com'}

Message: Import wiredtiger: 2e2d5fe23badac038944bdc6639eb62d77136f1a from branch mongodb-3.2

ref: b11ed312ce..2e2d5fe23b
for: 3.2.11

SERVER-26753 Minor speed regression (13%) and 'choppy' performance in 3.4 vs 3.2
WT-2415 Add option for join cursors to return false positives from a Bloom filter
WT-2945 Occasional hang running reconfigure fuzz test
WT-2947 test suite populate functions need refactoring
WT-2953 Test format failure - stuck cache with 1/42MB dirty
WT-2961 Add a version drop-down to the web version of the docs
WT-2964 WTPERF parallel-pop-btree hung in populate phase
WT-2968 backup cursor opens every file
WT-2971 Add documentation for raw-compression to the tune_compression page in WT docs
WT-2975 Reduce the memory allocated for statistics
WT-2991 Coverity issue 1365198
WT-2998 Add error messages to error returns that might be confusing.
WT-2999 Memory leak in join WT_CURSOR with custom extractor
WT-3001 WT_EXTENSION_API references are named inconsistently.
WT-3003 log generation should not auto generate the record and op types
WT-3005 Add top-level .gitignore file.
WT-3007 Remove duplicated word in comment
WT-3008 Run wtperf stress testing against all maintained branches
WT-3009 Test format hung/aborted with small cache (sub 10mB)
Branch: v3.2
https://github.com/mongodb/mongo/commit/7c626977cf052314d09648d1061e07c44e964264

Comment by Michael Cahill (Inactive) [ 07/Nov/16 ]

nick@innsenroute.com, thanks for the update: this is a little puzzling.

We can see clearly in the stats that the read lock spinning has been resolved: in 3.2.9 your data shows occasional spikes in this statistic of 150M/s, in 3.2.10 there were many more spikes going up to 450M/s. With the patch applied, this stat never goes above 166K/s (i.e., 1000x lower).

I'm not sure what to make of the lower performance: we don't track system statistics such as CPU and disk in MongoDB 3.2, so I can't correlate performance with CPU use from this data.

I am going to close this ticket because I believe the underlying issue has been resolved by changes that will be in the next release of MongoDB. If you want more followup, please reopen or file a new ticket.

Comment by Githook User [ 07/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: Import wiredtiger: 74430da40c96247c0a0d92f011ccfb2b175f1f32 from branch mongodb-3.4

ref: ef9a7983ea..74430da40c
for: 3.4.0

SERVER-26753 Minor speed regression (13%) and 'choppy' performance in 3.4 vs 3.2
WT-2415 Add option for join cursors to return false positives
WT-2945 Occasional hang running reconfigure fuzz test
WT-2947 test suite populate functions need refactoring
WT-2953 Test format failure - stuck cache with 1/42MB dirty
WT-2961 Add a version drop-down to the web version of the docs
WT-2964 WTPERF parallel-pop-btree hung in populate phase
WT-2968 backup cursor opens every file
WT-2971 Add documentation for raw compression to the WT docs
WT-2975 Reduce the memory allocated for statistics
WT-2991 Coverity issue 1365198
WT-2998 Add error messages to error returns that might be confusing.
WT-2999 Memory leak in join WT_CURSOR with custom extractor
WT-3001 WT_EXTENSION_API references are named inconsistently.
WT-3003 log generation should not auto generate the record and op types
WT-3005 Add top-level .gitignore file.
WT-3007 Remove duplicated word in comment
WT-3008 Run wtperf stress testing against all maintained branches
WT-3009 Test format hung/aborted with small cache (sub 10mB)
Branch: master
https://github.com/mongodb/mongo/commit/0f0db45770f45eab89bd7c2530a259f5ed5b6334

Comment by Nick Judson [ 04/Nov/16 ]

...and after rebooting the MongoDb host, it appears to have reverted back to the behavior I'm used to...

Comment by Nick Judson [ 04/Nov/16 ]

Must be something to do with my environment since I see a similar behavior in 3.2.9. Interesting to note that 3.2.9 still shows a better performance / lower cpu usage. I'm attaching the 3.2.9 run.

Comment by Nick Judson [ 04/Nov/16 ]

Run from scratch using custom build.

Comment by Nick Judson [ 04/Nov/16 ]

~Michael Cahill - I came in this morning to run another test and re-tried this test again (wiped everything clean this time). I don't know what has changed but I'm seeing a different behavior now with long blocks of 100% cpu usage. I'm going to attach some more diagnostic files for you to look at. So...this may not be completely resolved yet.

Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Add a call to spin on rwlocks with no active writers (#3126)
Branch: mongodb-3.2
https://github.com/wiredtiger/wiredtiger/commit/f418cea41b61f47fda740751495f9f5d39d814fd

Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Don't spin on a read-lock in a tight loop. (#3113)

We could be starving a thread that we are waiting on of CPU.

  • Switch to yielding on rwlocks rather than sleeping.
  • Revert the attempt to be clever when spinning on rwlocks.
Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Don't spin on a read-lock in a tight loop. (#3113)

We could be starving a thread that we are waiting on of CPU.

  • Switch to yielding on rwlocks rather than sleeping.
  • Revert the attempt to be clever when spinning on rwlocks.
Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Add a call to spin on rwlocks with no active writers (#3126)
Branch: mongodb-3.4
https://github.com/wiredtiger/wiredtiger/commit/f418cea41b61f47fda740751495f9f5d39d814fd

Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Don't spin on a read-lock in a tight loop. (#3113)

We could be starving a thread that we are waiting on of CPU.

  • Switch to yielding on rwlocks rather than sleeping.
  • Revert the attempt to be clever when spinning on rwlocks.
Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Don't spin on a read-lock in a tight loop. (#3113)

We could be starving a thread that we are waiting on of CPU.

  • Switch to yielding on rwlocks rather than sleeping.
  • Revert the attempt to be clever when spinning on rwlocks.
Comment by Githook User [ 04/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Add a call to spin on rwlocks with no active writers (#3126)
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/f418cea41b61f47fda740751495f9f5d39d814fd

Comment by Ramon Fernandez Marina [ 04/Nov/16 ]

Hi nick@innsenroute.com, thanks for testing the patch build and glad to hear it addressed this issue.

If you have other test cases where you see unexpected behavior please open a new SERVER ticket for each so we can take a look – and remember that all these issues are eligible for Bug Hunt prizes, so remember to set the "affects version/s" field to the specific release candidate you're using!

Thanks,
Ramón.

Comment by Nick Judson [ 03/Nov/16 ]

michael.cahill - yes, performance is flat as a pancake now. Thanks for looking into it! I have another test which simply uses the bulk insert api instead of single inserts (same test as above). This test appears to be really taxing on Mongo/WT and the performance is often worse than single-insert mode.

Would you recommend posting this to JIRA or perhaps to the google groups instead?

Thanks!

Comment by Michael Cahill (Inactive) [ 03/Nov/16 ]

nick@innsenroute.com, there is a build here:

https://evergreen.mongodb.com/version/5819599d3ff1221cb4000075

The Windows binaries are:

https://s3.amazonaws.com/mciuploads/mongodb-mongo-v3.2/windows-64-2k8-ssl/819946a376ba70cb6451a61695dcce42084f3863/binaries/mongo-mongodb_mongo_v3.2_windows_64_2k8_ssl_819946a376ba70cb6451a61695dcce42084f3863_16_11_02_03_14_24.zip

Please let me know if this resolves the spiky performance issues.

Comment by Nick Judson [ 02/Nov/16 ]

@Michael Cahill - sure. Send me a link and I'll try it out. Thanks!

Comment by Michael Cahill (Inactive) [ 02/Nov/16 ]

nick@innsenroute.com, we've figured out at least part of what is going on here: WiredTiger was spinning in a tight loop under some circumstances, leading to CPU starvation. A change has been merged into the WiredTiger tree, and will now go through more testing before being merged into MongoDB.

If you'd like to try out the change in the meantime to confirm that it improves things for you, please let me know and I'll prepare a build.

Comment by Githook User [ 02/Nov/16 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-26753 Don't spin on a read-lock in a tight loop. (#3113)

We could be starving a thread that we are waiting on of CPU.

  • Switch to yielding on rwlocks rather than sleeping.
  • Revert the attempt to be clever when spinning on rwlocks.
Comment by Nick Judson [ 27/Oct/16 ]

@Michael Cahill, I can set you up with my workload - I've sent you an email.

Comment by Michael Cahill (Inactive) [ 27/Oct/16 ]

nick@innsenroute.com thanks for the updated data. What we see in the diagnostic data is that there's nothing obviously going wrong when these spikes occur: sometimes they correlate with checkpoints but sometimes they happen independent of checkpoints, operations aren't waiting on I/O, the cache isn't full, etc.

There is a strong correlation with spikes in read-lock operations – in other words, it looks like we're spinning trying to get shared access to something. I have some ideas about what could be causing this because we made a change to the way we access the data structure representing the transaction table. If it's easy for me to run your workload, that would be great, otherwise we'll try to reproduce this ourselves.

Comment by Nick Judson [ 26/Oct/16 ]

@Alexander Gorrod:

  • Replication is not enabled
  • journaling is enabled
  • I'm using the c# driver defaults
  • The write load should be very consistent, as is the read load.

I can provide a copy of my workload, although I think someone over there already has it.

Comment by Nick Judson [ 26/Oct/16 ]

Attaching the same run for 3.2.10. I'm seeing similar behavior in 3.2.10 as 3.4rc1, although perhaps the spikes are not happening quite as often.

Comment by Alexander Gorrod [ 26/Oct/16 ]

Hi nick@innsenroute.com, thanks for the report and sorry about the performance regression in your workload. We have done some analysis on the data you uploaded, and the cause of the behavior difference isn't obvious.

We did notice that the 3.2 statistics are running MongoDB 3.2.9, where you said it was running 3.2.10. There are significant changes to WiredTiger between 3.2.9 and 3.2.10 - could you re-run the test with 3.2.10 and let us know the results?

It would also be useful if you can describe the configuration and workload, including:

  • Whether replication is enabled.
  • Whether journaling is enabled.
  • Whether you are doing j:true writes.
  • How stable the incoming requests are (it appears as though there are periodic increases in query load for example).

I also noticed that 3.4 was using less memory. With WT capped at 3GB, it was sitting at ~2.7 vs ~3.2 for 3.2. Are you using the more aggressive decommit in 3.4?

We have made a lot of changes to how WiredTiger manages the cache recently, some of those changes are aimed at reducing the amount of memory allocator fragmentation we see (i.e: allocator overhead). It is good to know that the changes have had the desired affect. To explain why the usage falls below 3GB overall, WiredTiger attempts to maintain used cache at 80% of the configured maximum (2.4GB for your configuration), and I can see from the statistics that cache usage is being maintained close to that. The other 300MB you observe is used for allocations outside of WiredTiger and fragmentation in the allocator cache.

Comment by Daniel Pasette (Inactive) [ 24/Oct/16 ]

Thanks for the report Nick. We'll take a look at the diagnostic data.

Comment by Nick Judson [ 24/Oct/16 ]

I also noticed that 3.4 was using less memory. With WT capped at 3GB, it was sitting at ~2.7 vs ~3.2 for 3.2. Are you using the more aggressive decommit in 3.4?

Generated at Thu Feb 08 04:13:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.