[SERVER-70012] Getting 'Sort exceeded memory limit' error when trying to reshard a collection Created: 27/Sep/22  Updated: 08/Nov/22  Resolved: 17/Oct/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.12, 5.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexey Zarubin Assignee: Yuan Fang
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-68139 Resharding command fails if the proje... Closed
Operating System: ALL
Steps To Reproduce:

Unfortunately, the only way to reproduce the problem is:

  1. Genearate a collection with 2.4B documents sharded over 10 shards
  2. Try resharding the collection using {_id: 'hashed'}
Participants:

 Description   

When trying to reshard a collection we are getting an error message telling that the sort limit is exceeded.

The resarding command is being run as follows:

[direct: mongos] admin> sh.reshardCollection('database.messages', {_id: 'hashed'})
MongoServerError: Sort exceeded memory limit of 104857600 bytes, but did not opt in to external sorting.

The collection contains approx. 2.4B documents split across 10 shards.

Shard distribution looks like this:

[direct: mongos] database> db.messages.getShardDistribution()
Shard rs06 at rs06/mongo-pm-repl06-1:27018,mongo-pm06:27018
{
  data: '78.1GiB',
  docs: 360057505,
  chunks: 5669,
  'estimated data per chunk': '14.1MiB',
  'estimated docs per chunk': 63513
}
---
Shard shard0001 at rs02/DS5033:27018,mongo-pm-repl02-1:27018
{
  data: '49.25GiB',
  docs: 245035457,
  chunks: 5670,
  'estimated data per chunk': '8.89MiB',
  'estimated docs per chunk': 43216
}
---
Shard shard0000 at rs01/DS5032:27018,mongo-pm-repl01-1:27018
{
  data: '43.17GiB',
  docs: 217760699,
  chunks: 5669,
  'estimated data per chunk': '7.79MiB',
  'estimated docs per chunk': 38412
}
---
Shard rs09 at rs09/mongo-pm-repl09-1:27018,mongo-pm09:27018
{
  data: '47.01GiB',
  docs: 220116142,
  chunks: 5669,
  'estimated data per chunk': '8.49MiB',
  'estimated docs per chunk': 38828
}
---
Shard rs03 at rs03/mongo-pm-repl03-1:27018,mongo-pm03:27018
{
  data: '37.81GiB',
  docs: 190704378,
  chunks: 5670,
  'estimated data per chunk': '6.82MiB',
  'estimated docs per chunk': 33633
}
---
Shard rs05 at rs05/DS5085:27018,mongo-pm-repl05-1:27018
{
  data: '36.7GiB',
  docs: 185097348,
  chunks: 5669,
  'estimated data per chunk': '6.63MiB',
  'estimated docs per chunk': 32650
}
---
Shard rs07 at rs07/mongo-pm-repl07-1:27018,mongo-pm07:27018
{
  data: '88.01GiB',
  docs: 413919369,
  chunks: 5669,
  'estimated data per chunk': '15.89MiB',
  'estimated docs per chunk': 73014
}
---
Shard rs04 at rs04/mongo-pm-repl04-1:27018,mongo-pm04:27018
{
  data: '38.33GiB',
  docs: 193215609,
  chunks: 5677,
  'estimated data per chunk': '6.91MiB',
  'estimated docs per chunk': 34034
}
---
Shard rs08 at rs08/mongo-pm-repl08-1:27018,mongo-pm08:27018
{
  data: '45.07GiB',
  docs: 207666324,
  chunks: 5668,
  'estimated data per chunk': '8.14MiB',
  'estimated docs per chunk': 36638
}
---
Shard rs10 at rs10/mongo-pm-repl10-1:27018,mongo-pm10:27018
{
  data: '37.04GiB',
  docs: 178733333,
  chunks: 2332,
  'estimated data per chunk': '16.26MiB',
  'estimated docs per chunk': 76643
}
---
Totals
{
  data: '7.81030509569951e+100GiB',
  docs: 2412306164,
  chunks: 53362,
  'Shard rs06': [
    '0 % data',
    '14.92 % docs in cluster',
    '232B avg obj size on shard'
  ],
  'Shard shard0001': [
    '0 % data',
    '10.15 % docs in cluster',
    '215B avg obj size on shard'
  ],
  'Shard shard0000': [
    '0 % data',
    '9.02 % docs in cluster',
    '212B avg obj size on shard'
  ],
  'Shard rs09': [
    '0 % data',
    '9.12 % docs in cluster',
    '229B avg obj size on shard'
  ],
  'Shard rs03': [ '0 % data', '7.9 % docs in cluster', '212B avg obj size on shard' ],
  'Shard rs05': [
    '0 % data',
    '7.67 % docs in cluster',
    '212B avg obj size on shard'
  ],
  'Shard rs07': [
    '0 % data',
    '17.15 % docs in cluster',
    '228B avg obj size on shard'
  ],
  'Shard rs04': [ '0 % data', '8 % docs in cluster', '213B avg obj size on shard' ],
  'Shard rs08': [ '0 % data', '8.6 % docs in cluster', '233B avg obj size on shard' ],
  'Shard rs10': [ '0 % data', '7.4 % docs in cluster', '222B avg obj size on shard' ]
}

The only reason I see for such behavior is that reshardCollection command does some preliminary queries/aggregation involving sorting for further work, like counting documents to copy etc., and during this phase something sort memory limit gets exceeded due to a large number of documents or chunks in the collection.

Probably those sort operations need to be done with allowDiskUse option set to tru



 Comments   
Comment by Alexey Zarubin [ 08/Nov/22 ]

Hi @Yuan Fang ,

The problem still occurs in 5.0.13. However, after studying the initial issue (SERVER-68139), we managed to work around it by adding a numInitialChunks argument to the adminCommand (numInitialChunks: 1000 is our case).

Comment by Yuan Fang [ 17/Oct/22 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Yuan Fang [ 30/Sep/22 ]

alexey.zarubin@gmail.com, Sounds good. FYI that MongoDB 5.0.13 has been released! I'm going to leave this SERVER ticket open just in case the issue isn't solved by upgrading to it.

Comment by Alexey Zarubin [ 29/Sep/22 ]

Greetings,

Thanks for the information and sorry for the duplicate - my bad, but I could not find the referenced issue. Due to high complexity of testing, we will be expecting 5.0.13 release then, thanks!

Comment by Yuan Fang [ 28/Sep/22 ]

Hi alexey.zarubin@gmail.com ,

Thank you for your report. The issue looks very similar to SERVER-68139 and has been fixed with its backport, which will be in the upcoming MongoDB 5.0.13 release. The MongoDB 5.0.13-rc0 (not for production) is released for testing and can be downloaded here. Would you like to test with the 5.0.13-rc0 and see if it solves the issue?

Regards,

Yuan

Generated at Thu Feb 08 06:15:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.