[SERVER-20388] Run Mongo 3.0 in sharded system(StoreEngine WiredTiger), config server will always inconsistent Created: 14/Sep/15  Updated: 09/Dec/15  Resolved: 16/Nov/15

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: qianyiyong Assignee: Sam Kleinman (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS: SLES 11 SP3, Linux version 3.0.101-0.40-default (geeko@buildhost) (gcc version 4.3.4 [gcc-4_3-branch revision 152973] (SUSE Linux) ) #1 SMP Thu Sep 18 13:09:38 UTC 2014 (44b8c95)
Memory: 96GB
MongoDB: mongodb-linux-x86_64-suse11-3.0.6.tgz
StoreEngine:WiredTiger


Operating System: ALL
Participants:

 Description   

I used three linux server create a sharded system with 3 shard server, 3 config server, 3 mongos, and used ycsb import 1000,000,0 records, then use ycsb write and read (50% to 50%) to test the mongodb, the config server will sometimes stop response, and the ycsb show 0 ops/sec.
I tested with MongoDB 3.0.4 and 3.0.6. There are all have this problem.

How to create the shard system:

  1. create folder

    mkdir -p /opt/qyy/data/mongodb/mongos/log
    mkdir -p /opt/qyy/data/mongodb/config/data
    mkdir -p /opt/qyy/data/mongodb/config/log
    mkdir -p /opt/qyy/data/mongodb/shard1/data
    mkdir -p /opt/qyy/data/mongodb/shard1/log
    mkdir -p /opt/qyy/data/mongodb/shard2/data
    mkdir -p /opt/qyy/data/mongodb/shard2/log
    mkdir -p /opt/qyy/data/mongodb/shard3/data
    mkdir -p /opt/qyy/data/mongodb/shard3/log
    

  2. start config server on 3 linux servers

    numactl --interleave=all ./mongod --configsvr --dbpath /opt/qyy/data/mongodb/config/data --port 27019 --logpath /opt/qyy/data/mongodb/config/log/config.log --fork --smallfiles --maxConns=65535 --storageEngine wiredTiger
    

  3. start router server on 3 linux servers

    numactl --interleave=all ./mongos  --configdb 10.68.160.95:27019,10.68.160.96:27019,10.68.160.100:27019  --port  27017  --logpath  /opt/qyy/data/mongodb/mongos/log/mongos.log  --fork
    

  4. start shard server on 3 linux servers

    numactl --interleave=all ./mongod --shardsvr --replSet shard1 --port 27020 --dbpath /opt/qyy/data/mongodb/shard1/data  --logpath /opt/qyy/data/mongodb/shard1/log/shard1.log --fork --oplogSize 10240 --storageEngine wiredTiger 
    numactl --interleave=all ./mongod --shardsvr --replSet shard2 --port 27021 --dbpath /opt/qyy/data/mongodb/shard2/data  --logpath /opt/qyy/data/mongodb/shard2/log/shard2.log --fork --oplogSize 10240 --storageEngine wiredTiger
    numactl --interleave=all ./mongod --shardsvr --replSet shard3 --port 27022 --dbpath /opt/qyy/data/mongodb/shard3/data  --logpath /opt/qyy/data/mongodb/shard3/log/shard3.log --fork --oplogSize 10240 --storageEngine wiredTiger
    

  5. init Replica set

    ./mongo 10.68.160.95:27020
    use admin
    config = { _id:"shard1", members:[{_id:0,host:"10.68.160.95:27020",priority:2},{_id:1,host:"10.68.160.96:27020",priority:1},{_id:2,host:"10.68.160.100:27020",priority:0.7}]}
    rs.initiate(config);
    exit
     
    ./mongo  10.68.160.96:27021
    use admin
    config = { _id:"shard2", members:[{_id:0,host:"10.68.160.95:27021",priority:0.7},{_id:1,host:"10.68.160.96:27021",priority:2},{_id:2,host:"10.68.160.100:27021",priority:1}]}
    rs.initiate(config);
    exit
     
    ./mongo    10.68.160.100:27022
    use admin
    config = { _id:"shard3", members:[{_id:0,host:"10.68.160.95:27022",priority:1},{_id:1,host:"10.68.160.96:27022",priority:0.7},{_id:2,host:"10.68.160.100:27022",priority:2}]}
    rs.initiate(config);
    exit
    

  6. add replica set to shard

    ./mongo  10.68.160.95:27017
    use  admin
    db.runCommand( { addshard:"shard1/10.68.160.95:27020,10.68.160.96:27020,10.68.160.100:27020"});
    db.runCommand( { addshard : "shard2/10.68.160.95:27021,10.68.160.96:27021,10.68.160.100:27021"});
    db.runCommand( { addshard : "shard3/10.68.160.95:27022,10.68.160.96:27022,10.68.160.100:27022"});
    

  7. shard collection

    db.runCommand( { enablesharding :"mydb"});
    db.runCommand( { shardcollection : "mydb.usertable", key:{_id: 'hashed'} } );
    

  8. use ycsb to import records to mydb
  9. use ycsb to test read and write(50%read 50%write or 80%read 20%write) through mongo router server.


 Comments   
Comment by qianyiyong [ 17/Nov/15 ]

Thanks.

Comment by Ramon Fernandez Marina [ 16/Nov/15 ]

qianyiyong@huawei.com, I haven't been able to get the config servers out of sync using the reproduction steps you provided, so I'm going to resolve this ticket.

I am able to see periods where there are 0 ops/sec, which may not be unexpected under certain circumstances. For example, if you're using the EXT3/4 filesystem these periods may correspond to WiredTiger checkpoints – see DOCS-5506 for more information.

Regards,
Ramón

Comment by qianyiyong [ 16/Nov/15 ]

I tested 3.0.7 which solved this problems( bug list), but still had this problem.

Comment by qianyiyong [ 16/Nov/15 ]

the real problem which I cased about is during the test there will be 0 current ops/sec occurred.

Comment by Ramon Fernandez Marina [ 13/Nov/15 ]

qianyiyong@huawei.com, we still haven't been able to reproduce this behavior, but we're still trying. Is this still an issue for you?

Comment by qianyiyong [ 15/Sep/15 ]

ycsb version 0.1.4

Comment by qianyiyong [ 15/Sep/15 ]

ycsb workloads file workloada's content like this:

  1. Yahoo! Cloud System Benchmark
  2. Workload A: Update heavy workload
  3. Application example: Session store recording recent actions
  4. Read/update ratio: 50/50
  5. Default data size: 1 KB records (10 fields, 100 bytes each, plus key)
  6. Request distribution: zipfian

mongodb.url=mongodb://localhost:27017/mydb
mongodb.database=mydb
mongodb.writeConcern=normal
recordcount=10000000
operationcount=8000000

workload=com.yahoo.ycsb.workloads.CoreWorkload

readallfields=true

readproportion=0.5
updateproportion=0.5
scanproportion=0
insertproportion=0

requestdistribution=uniform

fristly you run shell commond insert 10,000,000 records in the ycsb's bin directory:
#./bin/ycsb load mongodb -P workloads/workloada -s

when finished that, you test read and write performance of mongodb with wiredTiger by run 32 client threads(workloads is a direcotry, workloada is a configuration file content is like above):
#./bin/ycsb run mongodb -P workloads/workloada -threads 32 -s > 50_50.txt

Comment by Sam Kleinman (Inactive) [ 14/Sep/15 ]

qianyiyong@huawei.com, thanks for this report. We will attempt to reproduce this on our end but it might be helpful to have a copy of the ycsb invocation that you used to produce this issue. Can you provide the YCSB configuration and invocation that you used?

Regards,
sam

Comment by Michael Cahill (Inactive) [ 14/Sep/15 ]

I've moved this ticket to SERVER-20388 – that is the correct project for MongoDB issues.

Comment by qianyiyong [ 14/Sep/15 ]

maybe this is the problem, when ops/sec is 0, the config server always be timeout,
mongo router server log:
2015-09-14T09:23:32.699+0800 W SHARDING [conn195] will not perform auto-split because config servers are inconsistent
2015-09-14T09:23:32.718+0800 W SHARDING [conn190] will not perform auto-split because config servers are inconsistent
2015-09-14T09:23:32.736+0800 W SHARDING [conn225] will not perform auto-split because config servers are inconsistent
2015-09-14T09:23:32.752+0800 W SHARDING [conn185] will not perform auto-split because config servers are inconsistent
2015-09-14T09:23:32.769+0800 W SHARDING [conn198] will not perform auto-split because config servers are inconsistent
....
2015-09-14T10:08:15.695+0800 W SHARDING [Balancer] Skipping balancing round because data inconsistency was detected amongst the conf
ig servers.
2015-09-14T10:08:15.740+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 10.68.160.95:27020, reason: errno:111 Connec
tion refused
2015-09-14T10:08:25.743+0800 W NETWORK  [ReplicaSetMonitorWatcher] Failed to connect to 10.68.160.95:27020, reason: errno:111 Connec
tion refused
2015-09-14T10:08:25.997+0800 I SHARDING [Balancer] distributed lock 'balancer/controller-10-68-160-96:27017:1442051371:1804289383' u
nlocked.
2015-09-14T10:08:26.566+0800 I SHARDING [Balancer] distributed lock 'balancer/controller-10-68-160-96:27017:1442051371:1804289383' a
cquired, ts : 55f62c19c513d08f5ad3333a
2015-09-14T10:08:26.566+0800 W SHARDING [Balancer] Skipping balancing round because data inconsistency was detected amongst the conf
ig servers.
....
2015-09-14T10:14:44.059+0800 W SHARDING config servers 10.68.160.95:27019 and 10.68.160.96:27019 differ
2015-09-14T10:14:44.060+0800 W SHARDING config servers 10.68.160.95:27019 and 10.68.160.96:27019 differ
2015-09-14T10:14:44.060+0800 W SHARDING config servers 10.68.160.95:27019 and 10.68.160.96:27019 differ
2015-09-14T10:14:44.061+0800 W SHARDING config servers 10.68.160.95:27019 and 10.68.160.96:27019 differ
....

Comment by qianyiyong [ 14/Sep/15 ]

ycsb running results:

mongo connection created with mongodb://localhost:27017/mydb
mongo connection created with mongodb://localhost:27017/mydb
mongo connection created with mongodb://localhost:27017/mydb
mongo connection created with mongodb://localhost:27017/mydb
 10 sec: 146845 operations; 14659.58 current ops/sec; [UPDATE AverageLatency(us)=4747.28] [READ AverageLatency(us)=3100.76] 
 20 sec: 380205 operations; 23333.67 current ops/sec; [UPDATE AverageLatency(us)=3834.08] [READ AverageLatency(us)=1609.79] 
 30 sec: 633633 operations; 25342.8 current ops/sec; [UPDATE AverageLatency(us)=3607.15] [READ AverageLatency(us)=1461.35] 
 40 sec: 704089 operations; 7044.9 current ops/sec; [UPDATE AverageLatency(us)=15884.34] [READ AverageLatency(us)=1463.28] 
 50 sec: 922236 operations; 21814.7 current ops/sec; [UPDATE AverageLatency(us)=4636.76] [READ AverageLatency(us)=1389.38] 
 60 sec: 1157238 operations; 23497.85 current ops/sec; [UPDATE AverageLatency(us)=3116.63] [READ AverageLatency(us)=1379.11] 
 70 sec: 1157238 operations; 0 current ops/sec;   
 80 sec: 1157238 operations; 0 current ops/sec;   
 90 sec: 1157238 operations; 0 current ops/sec;   
 100 sec: 1157250 operations; 1.2 current ops/sec; [UPDATE AverageLatency(us)=7847005.17] [READ AverageLatency(us)=7810344.17] 
 110 sec: 1157250 operations; 0 current ops/sec;   
 120 sec: 1157250 operations; 0 current ops/sec;   
 130 sec: 1157250 operations; 0 current ops/sec;   
 140 sec: 1218963 operations; 6171.3 current ops/sec; [UPDATE AverageLatency(us)=159958.33] [READ AverageLatency(us)=1372.61] 
 150 sec: 1344541 operations; 12557.8 current ops/sec; [UPDATE AverageLatency(us)=3646.02] [READ AverageLatency(us)=1386.71] 
 160 sec: 1345138 operations; 59.69 current ops/sec; [UPDATE AverageLatency(us)=1982612.67] [READ AverageLatency(us)=330.68] 
 170 sec: 1364026 operations; 1888.8 current ops/sec; [UPDATE AverageLatency(us)=49501.3] [READ AverageLatency(us)=679.23] 
 180 sec: 1364026 operations; 0 current ops/sec;   
 190 sec: 1365371 operations; 134.5 current ops/sec; [UPDATE AverageLatency(us)=923235.23] [READ AverageLatency(us)=336362.98] 
 200 sec: 1365371 operations; 0 current ops/sec;   
 210 sec: 1544439 operations; 17906.8 current ops/sec; [UPDATE AverageLatency(us)=25266.54] [READ AverageLatency(us)=1623.46] 
 220 sec: 1759550 operations; 21508.95 current ops/sec; [UPDATE AverageLatency(us)=4383.39] [READ AverageLatency(us)=1562.55] 
 230 sec: 1996878 operations; 23732.8 current ops/sec; [UPDATE AverageLatency(us)=3824.85] [READ AverageLatency(us)=1557.5] 
 240 sec: 2204633 operations; 20773.42 current ops/sec; [UPDATE AverageLatency(us)=4612.64] [READ AverageLatency(us)=1551.19] 
 250 sec: 2421334 operations; 21670.1 current ops/sec; [UPDATE AverageLatency(us)=4314.12] [READ AverageLatency(us)=1577.61] 
 260 sec: 2635022 operations; 21368.8 current ops/sec; [UPDATE AverageLatency(us)=4246.97] [READ AverageLatency(us)=1530.38] 
 270 sec: 2843109 operations; 20806.62 current ops/sec; [UPDATE AverageLatency(us)=4144.63] [READ AverageLatency(us)=1412.55] 
 280 sec: 2843109 operations; 0 current ops/sec;   
 290 sec: 2843109 operations; 0 current ops/sec;   
 300 sec: 2843117 operations; 0.8 current ops/sec; [UPDATE AverageLatency(us)=34872596] [READ AverageLatency(us)=4981678.57] 
 310 sec: 2843117 operations; 0 current ops/sec;   
 320 sec: 2843140 operations; 2.3 current ops/sec; [UPDATE AverageLatency(us)=3775644.57] [READ AverageLatency(us)=328.22] 
 330 sec: 2843140 operations; 0 current ops/sec;   
 340 sec: 2843140 operations; 0 current ops/sec;   
 350 sec: 2843140 operations; 0 current ops/sec;   
 360 sec: 2843140 operations; 0 current ops/sec;   
 370 sec: 2843140 operations; 0 current ops/sec;   
 380 sec: 2843140 operations; 0 current ops/sec;   
 390 sec: 2954567 operations; 11142.7 current ops/sec; [UPDATE AverageLatency(us)=134201.87] [READ AverageLatency(us)=1387.26] 
 400 sec: 3096444 operations; 14187.7 current ops/sec; [UPDATE AverageLatency(us)=3384.65] [READ AverageLatency(us)=1187.05] 
 410 sec: 3096444 operations; 0 current ops/sec;   
 420 sec: 3096514 operations; 7 current ops/sec; [UPDATE AverageLatency(us)=6168657.37] [READ AverageLatency(us)=2523704.16] 
 430 sec: 3276702 operations; 18018.8 current ops/sec; [UPDATE AverageLatency(us)=20454.62] [READ AverageLatency(us)=1555.96] 
 440 sec: 3504301 operations; 22757.62 current ops/sec; [UPDATE AverageLatency(us)=4043.21] [READ AverageLatency(us)=1575.67] 
 450 sec: 3721778 operations; 21747.7 current ops/sec; [UPDATE AverageLatency(us)=4314.27] [READ AverageLatency(us)=1559.69] 
 460 sec: 3940607 operations; 21882.9 current ops/sec; [UPDATE AverageLatency(us)=4265.45] [READ AverageLatency(us)=1570.39] 
 470 sec: 4159443 operations; 21881.41 current ops/sec; [UPDATE AverageLatency(us)=4294.3] [READ AverageLatency(us)=1549.09] 
 480 sec: 4340655 operations; 18121.2 current ops/sec; [UPDATE AverageLatency(us)=5445.95] [READ AverageLatency(us)=1615.16] 
 490 sec: 4566847 operations; 22619.2 current ops/sec; [UPDATE AverageLatency(us)=4095.7] [READ AverageLatency(us)=1457.86] 
 500 sec: 4793051 operations; 22618.14 current ops/sec; [UPDATE AverageLatency(us)=3767.56] [READ AverageLatency(us)=1379.35] 
 510 sec: 4809251 operations; 1620 current ops/sec; [UPDATE AverageLatency(us)=70471.63] [READ AverageLatency(us)=301.99] 
 520 sec: 4824522 operations; 1527.1 current ops/sec; [UPDATE AverageLatency(us)=74272.87] [READ AverageLatency(us)=289.31] 
 530 sec: 4844614 operations; 2009 current ops/sec; [UPDATE AverageLatency(us)=56400.28] [READ AverageLatency(us)=463.67] 
 540 sec: 4844619 operations; 0.5 current ops/sec; [UPDATE AverageLatency(us)=13392784.5] [READ AverageLatency(us)=338]

Generated at Thu Feb 08 03:54:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.