[SERVER-31905] in 3.6, __wt_update_obsolete_check, uses significant cpu Created: 10/Nov/17  Updated: 12/Dec/17  Resolved: 21/Nov/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.6.0-rc3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jim Van Fleet Assignee: Keith Bostic (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File ba357b7.diff     File diag_log_3,4.tar.gz     File diag_log_3_6rc3.tar.gz     File jira.tar    
Issue Links:
Duplicate
duplicates WT-3752 WiredTiger can walk long update chain... Closed
Related
related to WT-3754 serial updates to the oldest should c... Closed
Operating System: ALL
Steps To Reproduce:

This regularly occurs after a few minutes of running the ycsb benchmark. Happens on both x86 and power

Participants:

 Description   

the calls to __wt_update_obsolete_check from __col_instantiate and __row_instantiate were not in 3.4. between 3.4 and 3.6-r2 these calls were added and, as a result, __wt_update_obsolete_check cpu usage builds to over 20% of system



 Comments   
Comment by Keith Bostic (Inactive) [ 21/Nov/17 ]

Fixed by WT-3752.

Comment by Keith Bostic (Inactive) [ 21/Nov/17 ]

Thank you, jvf!

I'll go ahead and close this report then, please don't hesitate to re-open it if there are any follow-on issues.

Comment by Jim Van Fleet [ 21/Nov/17 ]

Keith – I just tried this fix – I think it takes care of the problem that I reported. I want to do some more testing and will finish up tomorrow – nice job!

Comment by Keith Bostic (Inactive) [ 20/Nov/17 ]

jvf, there's a change we put into place last night that I believe may fix this problem, would you be able to test if it fixes the problem you're seeing?

I've attached the diff to this ticket (ba357b7.diff), and the patch would be applied in the directory mongo/src/third_party/wiredtiger/src/btree/, to the file __row_modify.c.

Comment by Jim Van Fleet [ 15/Nov/17 ]

I slightly lied - we're using workload "mixed" not workloada (they are
both 50% read, 50% write)

I included another snippet of perf top, the start-script, the ycsb
workloads in the script, in the attached file:

the script is generally started something like

sh tesCursor.sh localhost 27017 20 3_6fix

Jim

Comment by Keith Bostic (Inactive) [ 15/Nov/17 ]

jvf, would you please post a copy of the workloada that you're running, and any command-line arguments?

Thanks!

Comment by Jim Van Fleet [ 15/Nov/17 ]

Hi Keith –

My bad – what I sent was a snippet from perf top and somehow I put
"localhost....*" where __wt_update_obsolete_check should have been – it
is the 16%+ item.

My appolgies

Jim

Comment by Keith Bostic (Inactive) [ 15/Nov/17 ]

Hi, jvf. You said:

the calls to __wt_update_obsolete_check from __col_instantiate and __row_instantiate were not in 3.4. between 3.4 and 3.6-r2 these calls were added and, as a result, __wt_update_obsolete_check cpu usage builds to over 20% of system

But I'm not seeing __row_instantiate or __wt_update_obsolete_check in the perf report? I'm sure there's something obvious I'm missing – it makes sense to me that if the lookaside file is getting involved they could be there, but I don't see them?

alexander.gorrod, this is a zipfian distribution with modify, so yes, I suspect it's similar to WT-3739.

Comment by Alexander Gorrod [ 15/Nov/17 ]

keith.bostic I've assigned this ticket to you because I believe it may be a different manifestation of the issue outlined in WT-3739 - since it is running the same workload. I haven't had a chance to look at the diagnostic data yet, but I thought pairing the tickets might help circle around to the underlying cause for the performance regression.

Comment by Jim Van Fleet [ 13/Nov/17 ]

[attachment "diag_log_3_6rc3.tar.gz" deleted by Jim Van
Fleet/Austin/Contr/IBM]
[attachment "diag_log_3,4.tar.gz" deleted by Jim Van
Fleet/Austin/Contr/IBM]

Comment by Jim Van Fleet [ 13/Nov/17 ]

Sorry, used 3.6.0-rc3 – what I have seen is similar behavior in
localhost_27017_20171113

ycsb workloada – 50% read, 50% write

FYI – throughput for rc3 is 163173; for 3.4 is 300110

perf top output – rc3

      23.67%  [kernel]                 [k] snooze_loop
  16.33%  mongod                   [.] __wt_update_obsolete_check
   7.88%  [kernel]                 [k] _raw_spin_lock
   1.13%  mongod                   [.] __wt_row_search
   0.72%  mongod                   [.] __rec_txn_read
   0.63%  mongod                   [.] __wt_row_modify
   0.56%  mongod                   [.] tc_deletearray_nothrow
   0.55%  [kernel]                 [k] find_busiest_group
   0.54%  mongod                   [.] __wt_txn_get_snapshot

perf top output – 3.4

  20.27%  [kernel]                 [k] snooze_loop
   1.71%  [kernel]                 [k] _raw_spin_lock
   1.58%  mongod                   [.] snappy::internal::CompressFragment
   1.57%  mongod                   [.] __wt_row_search
   1.07%  libc-2.24.so             [.] __memcpy_power7
   0.94%  mongod                   [.] tc_deletearray_nothrow
   0.81%  mongod                   [.] operator new[]
   0.71%  mongod                   [.] __wt_txn_get_snapshot
   0.59%  libpthread-2.24.so       [.] pthread_mutex_lock
   0.58%  libpthread-2.24.so       [.] pthread_mutex_unlock
   0.49%  [kernel]                 [k] _raw_spin_lock_irqsave
   0.48%  mongod                   [.] mongo::BSONElement::size

Comment by Alexander Gorrod [ 13/Nov/17 ]

jvf Thanks for trying out the release candidate and sending feedback.

Could you please:

  • Give us details of the YCSB workload you were running
  • Give us details of the MongoDB configuration and setup you are using
  • Upload the content of the diagnostic.data subdirectories and mongod log files from a run with 3.4 and 3.6-rc2
Generated at Thu Feb 08 04:28:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.