[SERVER-21666] Small stall during concurrent insert/update Created: 25/Nov/15  Updated: 08/Feb/23  Resolved: 05/Feb/16

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.1, 3.0.2, 3.0.3, 3.0.4, 3.0.5, 3.0.6
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: 아나 하리 Assignee: David Hows
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SmallStall.tar.gz     Text File mongo_config_table_stats.txt     PNG File slow-queries.png    
Operating System: ALL
Steps To Reproduce:

Run attatched java test program

Participants:

 Description   

In some test case, Non periodic small stall is observed.

First time, I have thought wiredtiger checkpoint cause this small stall.
But, it's nothing to do with checkpoint (After some test and code eval, I found that wiredtiger table lock could cause this stall).
And this small stall makes a lot of slow query(over 150ms).
During this stall period, __wt_spin_lock(pthread_mutex_lock) took over 4~8 milliseconds.

__wt_schema_open_table :: elapsed 21 us
__wt_spin_lock :: elapsed:8394 us (pthread_mutex_lock :: session.open_cursor -> table creation(0))
__wt_schema_add_table :: elapsed 8436 us (need table lock)
__wt_schema_get_table :: elapsed 8437 us
__wt_curtable_open :: elapsed 8439 us
__wt_open_cursor :: elapsed 8456 us

My test case is really simple, Just make 1000 client thread and doing update(upsert:true) and total documents is about 400000.
And server have processed about 6~7000 update/sec. System load was under 1.0

12 core cpu, 32GB memory, 2.6.32-573.7.1.el6.centos.plus.x86_64
Looks like disk is nothing to do with this issue (I've tested NVME, SAS+RAID, SATA)

Current implementation, table lock is implemented with WT_SPINLOCK.
What do you think about changing it as WT_RWLOCK ?



 Comments   
Comment by Ramon Fernandez Marina [ 05/Feb/16 ]

matt.lee, we haven't heard back from you for some time, so we're closing this ticket. If this is still an issue for you please provide the information requested above so we can investigate further.

Thanks,
Ramón.

Comment by David Hows [ 11/Jan/16 ]

Hi Matt,

Sorry for the delay in getting back to you over the holiday period.

Would you be able to run the following generic MongoDB diagnostics script, it collects a number of MongoDB releated pieces of information and system diagnostics. I'd like to confirm what filesystem the data storage is hosted on, but having all the other details available would be valuable.

To execute this script please run the following and email me the resulting output (please don't attach it to the ticket).

wget https://raw.githubusercontent.com/mongodb/support-tools/master/mdiag/mdiag.sh
sudo bash mdiag.sh SERVER-21666

Additionally, would you be able to run the following small script to gather system statistics during another run of your script and then attach the resulting ss.log file to the ticket along with the server logfile.

mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:true}))); sleep(1000)}" >ss.log &

This should give me an insight as to what actions are being undertaken by the server when the delays are seen.

Thanks,
David

Comment by 아나 하리 [ 15/Dec/15 ]

Hi.

I attached slow query graph.
You will see the spike of red line (slow query counter) in the graph.

And this is sample slow queries. (I changed some part of query)

2015-12-15T17:17:22.510+0900 I COMMAND [conn15072] command DB.$cmd command: update { update: "COLL", ordered: true, updates: [ { q:

{ _id: 160228 }

, u: { $set:

{ id: 140738, ... }

, $currentDate:

{ updated_at: true }

}, upsert: true } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:55 locks:{ Global: { acquireCount:

{ w: 3 }

}, Database: { acquireCount:

{ w: 3 }

}, Collection: { acquireCount:

{ w: 3 }

} } 135ms
2015-12-15T17:17:22.511+0900 I COMMAND [conn14547] command DB.$cmd command: update { update: "COLL", ordered: true, updates: [ { q:

{ _id: 183208 }

, u: { $set:

{ id: 133677, ...}

, $currentDate:

{ updated_at: true }

}, upsert: true } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:55 locks:{ Global: { acquireCount:

{ w: 3 }

}, Database: { acquireCount:

{ w: 3 }

}, Collection: { acquireCount:

{ w: 3 }

} } 154ms
2015-12-15T17:17:22.510+0900 I COMMAND [conn15119] command DB.$cmd command: update { update: "COLL", ordered: true, updates: [ { q:

{ _id: 228814 }

, u: { $set:

{ id: 4016, ... }

, $currentDate:

{ updated_at: true }

}, upsert: true } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:55 locks:{ Global: { acquireCount:

{ w: 3 }

}, Database: { acquireCount:

{ w: 3 }

}, Collection: { acquireCount:

{ w: 3 }

} } 159ms

Comment by 아나 하리 [ 15/Dec/15 ]

Slow query graph

Comment by 아나 하리 [ 15/Dec/15 ]

Hi David Hows.

Slow query limit was 100ms in my test mongodb. Did you also set 100ms threshold for slow query logging ?
And your above comment, looks like some slow processing "2015-12-15 12:36:40 ~ 2015-12-15 12:36:41" is showed.
I think there have been a lot of slow query at that time (If slow query threshold is 100ms).

"Small stall", I used this word as 50~100ms level processing stop , Not 1~5 seconds level stop.
When this issue happen on our service, system load is increased 20 (according to linux uptime command) and cpu usage is increasing for a short time.

I will attach mongostat and log file which include slow query logs.

Comment by David Hows [ 15/Dec/15 ]

matt.lee, I have run your workload for an hour, and only seen a few long operations within the MongoDB logfiles.

Would you be able to provide me with the exact details of how you categorise a "small stall"? What specific indicators do you use? What what outputs are you using to see the stalls?

I'd like to ensure I am looking for the same issue that you have stated.

Comment by David Hows [ 15/Dec/15 ]

matt.lee, is the following indicative of the "small stall's" you are seeing?

2015-12-15 12:36:39            6561          6561             0
2015-12-15 12:36:40            2507          2507             0
2015-12-15 12:36:41            2328          2328             0
2015-12-15 12:36:42            6710          6710             0

Comment by Alexander Gorrod [ 15/Dec/15 ]

david.hows Could you get the reproducer for this running, and turn it into a js test script?

Comment by Keith Bostic (Inactive) [ 14/Dec/15 ]

matt.lee, I have not yet run your test program – I'm working on that, and will be in touch as soon as I have more information.

Comment by 아나 하리 [ 11/Dec/15 ]

Hi Keith.

No, It happens to me continuously during test.
I don't know the detailed WiredTiger internal story, but I saw the creation a bunch of WiredTiger session internally during test.
And this time, WiredTiger's table lock contention was getting high.

Have you ran the test program I attached ?

Comment by Keith Bostic (Inactive) [ 10/Dec/15 ]

Hi, matt.lee, thanks for your report.

It looks to me like the stalls you're seeing are on the open-cursor path, that is, they happen when the collections are first opened. Is that the case, do the stalls disappear once the collections have been opened?

Comment by 아나 하리 [ 25/Nov/15 ]

Mongo config & table stats

Generated at Thu Feb 08 03:58:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.