[SERVER-40065] spike latency issue on shard cluster Created: 11/Mar/19  Updated: 22/Apr/19  Resolved: 22/Apr/19

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adun Assignee: Eric Sedor
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 3.4_primary.config.diganostic.data.tgz     File 3.4_primary.shard1.diganostic.data.tgz     File 3.4_primary.shard2.diganostic.data.tgz     File 4.0_primary.config.diagnostic.data.tgz     File 4.0_primary.shard1.diagnostic.data.tgz     File 4.0_primary.shard2.diagnostic.data.tgz    
Issue Links:
Duplicate
duplicates SERVER-40707 Secondary couldn't signal OplogWaiter... Closed
Operating System: ALL
Participants:

 Description   

on MongoDB 3.4.19, shard cluster with 2 shards.

collection `foo` is sharded with  hashed key "UID"

> sh.enableSharding("foo")
> db.adminCommand({"shardCollection": "foo.foo", key: {"UID": "hashed"}})
> use foo
> db.foo.createIndex({"name": 1})

we write a test case in mongo-cxx-driver to test the latency.

auto nStart = std::chrono::high_resolution_clock::now();
oDB[DB_TABLE_NAME].update_one(document{}<<"UID"<< nUID <<finalize, oValue.view(), opt);
auto nEnd = std::chrono::high_resolution_clock::now();
std::chrono::duration<double, std::milli> diff = nEnd - nStart;
printf("uid %d timediff %f\n", nUID, diff.count());

 

output as below:

uid 10030116 timediff 1.249126
uid 10000124 timediff 1.021864
uid 10050127 timediff 1.172118
uid 10020116 timediff 1.223791
uid 10040115 timediff 1.408828
uid 10070114 timediff 1.526046

 

What we did:

  1. The test case run with 8 threads, each thread runs 10000 upserts loop to the collection.
  2. Each upsert operation is on different UID.
  3. Except UID/ID, all docs are always the same.

What we found:

1. the latency has many spike, such as 4000+ ms in some update_one operation, but not any slow log in mongod server(we set slowms to 10ms).
2. as 1, the total timediff is unstable(FROM: ./testmongo | awk '{s+=$NF}END{print s}')

149726
129175
124993
219767
137422
156674
162410
119684
117086
116885

3. after we run `rs.stepDown()` on config Primary, the spike gone away(FROM: ./testmongo | awk '{s+=$NF}END{print s}')

106808
107225
107228
108055
106660
108690
105993
107037
106226
104789
106494
105178
107428
108650
106535

 

What we tried:

  1. initial the test db without sharding but create index UID as hashed, name as 1, the latency is normal and stable;
  2. run the benchmark to each replicaSet, teh latency is normal and stable;
  3. try to stepDown replicaSet, the latency may return normal.(not always)
  4. try to stepDown config again, the latecy may abnormal/normal again.(not always)
  5. replace the wiredTiger storage to MMAPv1, have the same sistuation.(can return normal after stepDown something)
  6. try to stop balance, it has not help;
  7. downgrade to MongoDB 3.2.22, the latency is normal and stable
  8. not any abnormal log, we only found split point lookup and network connection logs.


 Comments   
Comment by Eric Sedor [ 22/Apr/19 ]

Hi stutiredboy@gmail.com; we are closing this ticket as a duplicate of SERVER-40707 and encourage you to watch that ticket for updates.

Comment by Eric Sedor [ 18/Apr/19 ]

Thanks for pointing this out, stutiredboy@gmail.com; we'll re-open this ticket, check on this relationship, and also see how we might have diagnosed this sooner.

Comment by Adun [ 18/Apr/19 ]

@Eric

issue https://jira.mongodb.org/browse/SERVER-40707 is the root cause of this issue

Comment by Eric Sedor [ 19/Mar/19 ]

Thanks stutiredboy@gmail.com; unfortunately we're not seeing anything to indicate a bug. We would expect a server-side problem introducing this latency would result in logged slow ops. Without that, there are other possibilities such as network latency or a driver-side configuration issue such as connection pool size.

As this ticket does not appear to a bug, I will now close it. If you need further assistance troubleshooting, please post on the mongodb-user group or on Stack Overflow with the mongodb tag.

If you are able to narrow this experiment down to a reproducible bug, we'd be interested in that.

Comment by Eric Sedor [ 18/Mar/19 ]

We appreciate your patience stutiredboy@gmail.com, and are continuing to review and discuss internally.

Comment by Adun [ 14/Mar/19 ]

we run on MongoDB 3.4 again

$ ./mongotest_cpp34 | tee 1.log | awk '{s+=$NF}END{print s}'
107833
$ ./mongotest_cpp34 | tee 2log | awk '{s+=$NF}END{print s}'
109542
$ ./mongotest_cpp34 | tee 3.log | awk '{s+=$NF}END{print s}'
109615
$ ./mongotest_cpp34 | tee 4.log | awk '{s+=$NF}END{print s}'
145013
$ ./mongotest_cpp34 | tee 5.log | awk '{s+=$NF}END{print s}'
126852
$ ./mongotest_cpp34 | tee 6.log | awk '{s+=$NF}END{print s}'
108585
$ ./mongotest_cpp34 | tee 7.log | awk '{s+=$NF}END{print s}'
149275

$ sort -k4 -n -r 7.log

uid 10044767 timediff 4279.467329
uid 10031864 timediff 3968.625955
uid 10021842 timediff 3941.699426
uid 10011782 timediff 3871.421174
uid 10033760 timediff 3579.863499
uid 10052268 timediff 3087.380755
uid 10062160 timediff 3071.985371
uid 10013761 timediff 2878.962980
uid 10075283 timediff 2830.154280
uid 10023997 timediff 2735.049440
uid 10064571 timediff 2127.151757
uid 10002514 timediff 1921.079368
uid 10004980 timediff 1776.391891
uid 10060000 timediff 75.260478
uid 10040000 timediff 73.680193
uid 10030000 timediff 70.422951
uid 10010000 timediff 69.084409
uid 10020000 timediff 68.970113
uid 10070000 timediff 68.486014
uid 10050000 timediff 68.443622
uid 10000000 timediff 64.418417
uid 10012865 timediff 14.042705
uid 10079510 timediff 13.110099

......

uid 10064410 timediff 0.944810
uid 10050058 timediff 0.943896
uid 10031833 timediff 0.942367
uid 10061742 timediff 0.941366
uid 10062108 timediff 0.940045
uid 10064453 timediff 0.939963
uid 10069467 timediff 0.937904
uid 10030796 timediff 0.937348
uid 10075503 timediff 0.933544
uid 10077199 timediff 0.932703
uid 10058929 timediff 0.932643
uid 10054640 timediff 0.932521
uid 10034238 timediff 0.931722
uid 10062967 timediff 0.929753
uid 10041824 timediff 0.928643
uid 10037802 timediff 0.928581
uid 10067611 timediff 0.927270
uid 10073309 timediff 0.927019

the slowms is 100ms, but we can't get any slow log in mongod.

> db.getProfilingStatus()


{ "was" : 0, "slowms" : 100 }

the diganostic.data from 3.4.19 were attached too.

Comment by Adun [ 14/Mar/19 ]

yes, awk dicard the decimal point, the real total value may be as: 135259.646036

```

uid 10035044 timediff 1.547164
uid 10003822 timediff 2966.525757
uid 10023620 timediff 4055.941557
uid 10035045 timediff 1.479359
uid 10053296 timediff 4798.573361
uid 10073303 timediff 4346.899593
uid 10003823 timediff 1.621379
uid 10035046 timediff 1.416700
uid 10023621 timediff 2.065020
uid 10013522 timediff 3663.769046
uid 10063516 timediff 3771.117739
uid 10053297 timediff 1.573778
uid 10043645 timediff 3873.324383
uid 10035047 timediff 1.342320
uid 10073304 timediff 1.823686

```

 

i have attached diagnostic.data from 4.0.6

 

Comment by Eric Sedor [ 13/Mar/19 ]

It is very strange to hear about 4,000ms of latency when no slow operations are logged. Can you also clarify the numbers in the awk output above? Is there a decimal point missing?  If not and this is milliseconds then a lack of slow logged ops is very unexpected.

For the 4.0.6 deployment, would you please run this test, archive (tar or zip) the $dbpath/diagnostic.data directory (described here) and attach it to this ticket?

Comment by Eric Sedor [ 13/Mar/19 ]

Hello and thank you for the detail you've provided so far. We are investigating and will let you know if we have additional questions.

Comment by Adun [ 12/Mar/19 ]

What we tried:

10. on MongoDB 3.6.5, compare to normal results on 3.4.19(about 10xxxx), it has a little higher latency, but stable:

109008
113226
111779
113159
111060
110741
111174
109748
111290
110055

11. on MongoDB 4.0.6, compare to normal results on 3.4.19((about 10xxxx)), it has higher latency, but stable:

131276

130429

127002

129441

130086

129963

129655

131308

131948

131162

131298

131438

as our benchmark, we know 3.6 and 4.0 have lower performance than before, so we think the result from 3.6 and 4.0 and normal, without high spike.

Comment by Adun [ 11/Mar/19 ]

149726
129175
124993
219767
137422
156674
162410
119684
117086
116885

the output above is from:

while true ; do ./testmongo | awk '{s+=$NF}END{print s}' ; sleep 2 ; done

What we tried:

9. write the test with mongo-cxx-driver r3.3.1 and r3.4.0,  on MongoDB 3.4, they have the same behave: unstable latency with high spike

Generated at Thu Feb 08 04:53:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.