Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 4.4.9
Component/s: None
Labels:
None

Assigned Teams:

Server Triage
Operating System:
ALL
Confidence Status:
None
Work Order:
3

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name:
None
Goal Link:
None

We have a sharded cluster with 2 shards Shard A and Shard B, each in primary + 2 replicas configuration.
Shard A and B are in two different zones so that Shard A should contain recent data up to a cutoff and ShardB historical data.
Shard A is a machine with SD drive, shardB has a normal disk (since it is supposed to have lower usage patterns)

Initially there was only Shard A which grew to almost 7GB and around 600k chunks, with standard max chunk size.
We then attached ShardB, hoping to be able to move all the historical data in a relatively quick amount of time.

Since ShardB has been attached, the balancer has started moving chunks from A to B. The speed was already initially pretty bad but we have noticed that it got much worse with time, as you can see here:

At this speed it will take probably years to move the amount of data that we need to move.
Note that we've also tried to:
1. Completely stop the traffic to the cluster. This improved things a bit but didn't make a huge difference.
2. Merge chunks to increase their size. This is not helping. The last few very bad datapoints you can see in the charts are due to a series of chunks that are currently being moved that have an average size of 200Mb. When those chunks will be done we hope things will get slightly better.

A transfer rate of 200k/s seems really really low, so we basically want to know what kind of options we have.
Have we hit some kind of intrinsic bottleneck? How can we debug this issue?
Any help would be appreciated.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

Migrated bytes per second.png
26 kB
Oct 30 2021 05:12:39 AM UTC
Migrated bytes per second-1.png
26 kB
Oct 30 2021 05:14:59 AM UTC
Migrated docs per second.png
29 kB
Oct 30 2021 05:14:34 AM UTC

Assignee:: [HELP ONLY] Backlog - Triage Team

Reporter:: Daniele Tessaro

Participants:: [HELP ONLY] Backlog - Triage Team, Daniele Tessaro, Edwin Zhou

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Created:: Oct 30 2021 05:28:13 AM UTC

Updated:: Dec 06 2022 12:49:31 AM UTC

Resolved:: Nov 01 2021 07:50:45 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates