Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Critical - P2
Fix Version/s: 4.2.14, 4.4.6, WT10.0.1, 5.0.0-rc0
Affects Version/s: None
Component/s: None
Labels:
None

Sprint:
None
Story Points:
None
Case:

Backport Requested:

v4.4, v4.2

Random cursors can be quite slow on a multi-GB table with very large records.

A customer has experienced slow startup times in 4.2 that they didn't see in 4.0. Based on their logs mongod is spending a lot of time iterating a random cursor though the oplog — in one case it takes 15 minutes to perform 993 cursor->next() calls on a ~26GB oplog. The oplog had only 62590 records, so the average record size is 100s of KB.

A simple python script, which I'll attach, shows similar behavior on a large file. I populate a table with 60K 500 KB records, assigning them increasing keys to mimic the oplog which adds records with increasing timestamps. When that's done, I create a random cursor and call next() 1000 times. It takes almost 3 minutes on my virtual workstation.

Additional notes:

Oddly increasing the WT cache size from the default (100MB, I think) to 4GB made the cursor iteration increase from ~130 sec to ~170 sec.

This isn't nearly as slow on a larger machine. On ocelot-aws (16 cores, 64GB RAM) it only takes 20 sec.

I've been reproducing this on the develop branch and haven't tried running specifically against 4.2.

~~WT-5068~~ was a change to the random cursor implementation that went into 4.2. So it's possible this has something to do with that change. (UPDATE: See my comment, below. ~~WT-5068~~ definitely introduced this.)

There is a lot of variability in the different cursor->next() calls. Some take a couple milliseconds. Some take half a second. I guess it's not surprising that a random cursor behaves unpredictably.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

00755353.tgz
8.09 MB
Apr 20 2021 01:08:25 AM UTC
Logs_758841.tgz
5.82 MB
Apr 19 2021 07:15:38 PM UTC
test_foo.py
1 kB
Mar 26 2021 12:27:21 AM UTC
test_foo2.py
3 kB
Mar 31 2021 04:31:34 PM UTC

related to

SERVER-55821 remove next_random_sample_size=1000 configuration in the oplog sampling code

Closed

Assignee:: Keith Bostic (Inactive)
Reporter:: Keith Smith
Votes:: 1 Vote for this issue
Watchers:: 33 Start watching this issue

Created:: Mar 25 2021 11:59:19 PM UTC
Updated:: Oct 29 2023 04:42:12 PM UTC
Resolved:: Apr 20 2021 07:30:42 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates