[SERVER-3569] Find operation does not return all documents that are previously inserted into a sharded environment while chunks are moved. Created: 10/Aug/11  Updated: 10/Dec/14  Resolved: 28/May/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.8.2, 1.9.1, 2.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Wagner Assignee: Randolph Tan
Resolution: Duplicate Votes: 5
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubunutu 10.10 x86_64


Attachments: Zip Archive mongo-sharding-test.zip    
Issue Links:
Duplicate
is duplicated by SERVER-8741 mrShardedOutput.js failing on multipl... Closed
Operating System: ALL
Participants:

 Description   

To test the mongoDb sharding capabilities we have continuously inserted data into a two shards environment. Each shard consists of
one replica set (1 Master, 1 Secondary and 1 Arbiter). At a given point we perform some find operations to verify whether all data is returned. The problem is that there are sometimes gaps in the result. E.g. the result contains document 5200 and 5202 but not document 5201. We are using the mongo java driver 2.6.2 but could also experience the behavior with the php driver.

These gaps are independent of the used write concern.

To reproduce the problem a java maven project is attached to this issue:

  • Follow the steps in the README.txt to run the test

Details:

  • The test application provides 3 classes with a main method located in src\main\java\com\seitenbau\testing\mongo\runner: Inserter, VerifierSharding, VerifierFinding
  • The Inserter inserts documents that contain a field storing a counter (the counter is incremented each time a document is inserted). The script generates the file "counter-state" which stores the current value of the counter.
  • The VerifierSharding logs some information on sharding and counting (verifier-sharding-info.log).
  • The VerifierFinding performs a find operation to find all documents that are saved in the target collection at the moment.
    The result set is sorted and then logged in a file with the format <timestamp>-node-data. Additionally the script generates the log file "verifier-finding-info.log".
  • To configure the application check the "config.properties" file
  • You have to abort the scripts manually as they perform their intended job inside a while loop.


 Comments   
Comment by Scott Hernandez (Inactive) [ 22/Sep/11 ]

The issue is the chunk distribution (based on the shard key) and not the number of shards. If there are few chunks, like the initial one when you shard the collection before there is much data, it will have to do lots of migration until the distribution of chunks is even among all shards. Even then if the shard key isn't well distributed there can be host spots/chunk will require more splits, and will cause the balancer to move them around, which will cause more of these issues. With a load and well distributed system this will be much, much less likely over time.

You could manage this a bit by only doing balancing in maint. windows which will make this behavior more predictable.

Comment by Michael Wagner [ 22/Sep/11 ]

We could reproduce this issue with mongoDb version 2.0.0. To create the test environment we have stopped all mongo processes, replaced the old binaries, performend a clean up upon all data and finally started all mongo processes again.

Comment by Michael Wagner [ 22/Sep/11 ]

The issue also appears if both shards are already activated at the beginning of our test.

Comment by Scott Hernandez (Inactive) [ 22/Sep/11 ]

That pretty much makes the issue one of chunk migration with the queued write-backs happening at the end of the migration.

With a better balanced system, or more stable distribution of chunk, this should be less of an issue as compared to a simple test starting from an empty data-set.

Comment by Michael Wagner [ 22/Sep/11 ]

The test environment is equal to the environment used to reproduce SERVER-3568

Comment by Michael Wagner [ 22/Sep/11 ]

In case the balancing is turned off the described issue does not appear.

Comment by Scott Hernandez (Inactive) [ 22/Sep/11 ]

If balancing is turned off before running the tests, does the issue still appear?

Generated at Thu Feb 08 03:03:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.