Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 3.2.6
Component/s: mongodump
Labels:
None
Environment:
AWS EC2 Environment
Ubuntu 12.04 LTS
6 x i2.4xLarge Instances, in 2 Groups of 3
MongoDB 3.0.12
18 Shards, 9 shards per group
1 Primary, 1 Secondary, 1 Hidden Secondary per Group
MongoDump host running local MongoS

Sprint:
Platforms 2016-08-26
Backport Requested:

v3.2

NOTE: I am aware of ~~TOOLS-902~~ and that the cause was not found. The workaround of increasing the cursor timeout globally across every MongoD only masks the underlying issue.

When using the --gzip option with Mongodump, which compresses each collection as it pulls them down from the Cluster, it is possible to encounter "getMore: cursor didn't exist on server" failures during the dump process against sharded collections.

There are a number of moving parts to recreate the failure (some of which are still under investigation for confirmation). However, one fact is present in every failure: The collection is sharded and distributed across all 18 shards.

Initially, we suspected an unbalanced shard distribution, as our first failures were against collections that had uneven document distribution (caused by wide size variation of the documents). However, a more recent failure was against a collection which was very evenly balanced, and contained ~150 million documents.

I believe the core issue is that MongoS opens a cursor against each shard for a given collection, and is then dependant on Mongodump to iterate through the collection fast enough to ensure every cursor to every shard is iterated in less than 10 minute intervals. If Mongodump doesn't iterate through the collection fast enough (i.e. gzip compression is slowing it down, or host CPU is consumed by other processes, or one shard holds a large number of the documents vs other shards), the cursors opened on a given shard may be closed due to the timeout.

This would also mean that MongoDump with --gzip could be very sensitive to the performance of the host it is run on. It would also suggest that the likelihood of encountering this issue increases with the number of Shards present in the cluster.

Additional notes for consideration:

When excluding the --gzip option, the mongodump works fine (but takes 3-5x longer).
When using the --gzip option AND setting --numParallelProcesses to 1, the mongodump succeeds (but again, also takes 3.5x longer).

It makes sense that in ~~TOOLS-902~~ that increasing the global cursor timeout helped resolve the issue for one customer. However, I don't believe this is a genuine fix for the problem.

Suggestions:

Setting batch size for the cursor to a smaller number? i.e. default / number-of-shards? Or making it an option for an end user to specify (when they hit this issue)?
Encouraging the SERVER team to implement per-cursor timeout values to allow a custom longer window (vs using the global method)?

is related to

TOOLS-902 mongodump frequently fails with error "error reading collection: invalid cursor"

Closed

Backport to v3.2

TOOLS-1311

Closed

Gabriel Russell (Inactive)

Assignee:: Gabriel Russell (Inactive)
Reporter:: Dave Muysson
Votes:: 0 Vote for this issue
Watchers:: 1 Start watching this issue

Created:: Jul 06 2016 02:33:51 PM UTC
Updated:: May 08 2018 05:54:29 PM UTC
Resolved:: Aug 29 2016 02:51:55 PM UTC

Details

Description

Attachments

Issue Links

Forms

Sub-Tasks

Activity

People

Dates