[SERVER-23487] Issue with Sharding Collection SplitVector timeout Created: 04/Apr/16  Updated: 24/May/16  Resolved: 24/May/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pierre MOELO [X] Assignee: Spencer Brody (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-23784 Don't use 30 second network timeout o... Closed
Operating System: ALL
Sprint: Sharding 15 (06/03/16)
Participants:

 Description   

Hi Team,

My problem is when I run the shardcollection command, nothing happens and in some cases I have a timeout

{ "code" : 50, "ok" : 0, "errmsg" : "Operation timed out" }

I do :

mongos> use admin
 
mongos> db.runCommand( { shardCollection : "Dev.link" , key : { urlHash : 1 } , unique : true } );

In the logfile :

#1# 2016-04-02T23:55:38.286+0200 I NETWORK  [initandlisten] connection accepted from 192.168.0.101:39391 #14 (6 connections now open)
 
#2# 2016-04-02T23:58:51.179+0200 I COMMAND  [conn9] command admin.$cmd command: checkShardingIndex { checkShardingIndex: "link.link", keyPattern: { urlHash: 1.0 } } keyUpdates:0 writeConflicts:0 numYields:541544 reslen:134 locks:{ Global: { acquireCount: { r: 1083090 } }, Database: { acquireCount: { r: 541545 } }, Collection: { acquireCount: { r: 541545 } } } protocol:op_command 231824ms
 
#3# 2016-04-02T23:58:51.256+0200 I SHARDING [conn14] request split points lookup for chunk repolink2.link { : MinKey } -->> { : MaxKey }
2016-04-02T23:59:23.096+0200 W SHARDING [conn14] Finding the split vector for repolink2.link over { urlHash: 1.0 } keyCount: 187454 numSplits: 369 lookedAt: 135495 took 31839ms
 
#4# 2016-04-02T23:59:23.096+0200 I COMMAND  [conn14] command admin.$cmd command: splitVector { splitVector: "repolink2.link", keyPattern: { urlHash: 1.0 }, min: { urlHash: MinKey }, max: { urlHash: MaxKey }, maxChunkSizeBytes: 1073741824, maxSplitPoints: 0, maxChunkObjects: 0 } keyUpdates:0 writeConflicts:0 numYields:541456 reslen:32528 locks:{ Global: { acquireCount: { r: 1082914 } }, Database: { acquireCount: { r: 541457 } }, Collection: { acquireCount: { r: 541457 } } } protocol:op_command 31839ms
 
#5# 2016-04-02T23:59:23.096+0200 I NETWORK  [conn14] end connection 192.168.0.101:39391 (5 connections now open)

I have 3 configserver, 1 mongos, 2 mongodb.



 Comments   
Comment by Kelsey Schubert [ 17/May/16 ]

Hi PM,

We still need the logs Spencer requested to diagnose the problem. If this is still an issue for you, would you please increase the logging verbosity to level 2 by running db.setLogLevel(2) and execute the shardCollection command on both the mongos and the shard mongods? Afterwards, please return the log verbosity to level 0 and upload the logs to the upload portal.

Thank you,
Thomas

Comment by Spencer Brody (Inactive) [ 29/Apr/16 ]

Hi PM,
Can you please confirm that when you upgraded to 3.2.6 you upgraded both your mongos processes as well as your mongod processes and not just one or the other? Assuming you did then I'll need some extra information to diagnose what's happening - namely I'll need logs from the mongos running the shard collection command and the mongod primary of each shard. If possible, please increase the log verbosity to level 2 (by running "db.setLogLevel(2)" in the shell) before running the shardCollection command (on both the mongos and the shard mongods). Make sure to turn the log level back to 0 after finishing the shardCollection command as running with log level 2 can generate a lot of log messages which can quickly cause the log file to grow quite large.

If you don't want to attach the log files to this (publicly visible) ticket, feel free to open a new ticket in our Community Private (SUPPORT) project and attach them there. If you do so, just make sure to leave a comment on this ticket with the id of the new SUPPORT ticket so I know to go look at it.

Cheers,
-Spencer

Comment by Pierre MOELO [X] [ 29/Apr/16 ]

Hi @Thomas,

I did the update to MongoDB 3.2.6 but the issue is still there.

Comment by Kelsey Schubert [ 26/Apr/16 ]

Hi PM,

Thank you for the additional information. I recommend that you upgrade to MongoDB 3.2.6-rc0, the release candidate for MongoDB 3.2.6, which includes SERVER-23784. This issue may explain the time outs that you are observing.

Would you please upgrade and and let us know if the issue persists?

Kind regards,
Thomas

Comment by Pierre MOELO [X] [ 15/Apr/16 ]

Done, let me know if something is missing.

Comment by Kelsey Schubert [ 15/Apr/16 ]

Hi PM,

We are still investigating this issue and exploring explanations for the slow query you are observing. To assist our investigation, can you please reproduce this issue and provide the following information:

  • an archive (tar or zip) of $dbpath/diagnostic.data directory
  • the complete logs of the affected shard (to correlate the diagnostic data to logged events)

Please use the same portal as before.

Thank you,
Thomas

Comment by Pierre MOELO [X] [ 13/Apr/16 ]

Hi @Thomas,

Do you have any news ?

Thanks in adavance

Comment by Pierre MOELO [X] [ 07/Apr/16 ]

Hi @Thomas,

I just upload the output of currentOp and stats.

Comment by Kelsey Schubert [ 06/Apr/16 ]

Hi PM,

Thank you for uploading the logs.

  1. Can you please provide the output of db.currentOp(), while the explain query is executing?
  2. Can you post the output of db.link.stats()?

Thanks again for your help,
Thomas

Comment by Pierre MOELO [X] [ 06/Apr/16 ]

Hi @Thomas,

I added the logs yesterday.
I executed the command shown above but this one is still ongoing (90000 sec).

Comment by Kelsey Schubert [ 04/Apr/16 ]

Hi PM,

Additionally, if the Dev.link collection is still not sharded, can you please execute and post the output of the following command on the shard containing the collection:

use Dev
db.link.find().min({ urlHash: MinKey }).max({ urlHash: MaxKey }).explain(true)

Thank you,
Thomas

Comment by Kelsey Schubert [ 04/Apr/16 ]

Hi PM,

Thank you for clarifying that you are using config servers as a replica set. So we can continue to investigate, can you please upload the complete logs of your config servers, mongos, and shards? I have created a secure upload portal for you to use here. Files you upload to this portal will only be visible to MongoDB employees investigating this issue.

Kind regards,
Thomas

Comment by Pierre MOELO [X] [ 04/Apr/16 ]

Hi @Thomas,

Actually , I am using config servers as a replica set.
I have this problem on two database ~200GB.
I tried on an empty database and it worked.

csReplSetDev:PRIMARY> rs.status()
{
        "set" : "csReplSetDev",
        "date" : ISODate("2016-04-04T14:23:11.590Z"),
        "myState" : 1,
        "term" : NumberLong(12),
        "configsvr" : true,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "DEV01:27019",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 145861,
                        "optime" : {
                                "ts" : Timestamp(1459779789, 5),
                                "t" : NumberLong(12)
                        },
                        "optimeDate" : ISODate("2016-04-04T14:23:09Z"),
                        "electionTime" : Timestamp(1459633943, 1),
                        "electionDate" : ISODate("2016-04-02T21:52:23Z"),
                        "configVersion" : 9,
                        "self" : true
                },
                {
                        "_id" : 1,
                        "name" : "DEV02:27019",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 145854,
                        "optime" : {
                                "ts" : Timestamp(1459779789, 5),
                                "t" : NumberLong(12)
                        },
                        "optimeDate" : ISODate("2016-04-04T14:23:09Z"),
                        "lastHeartbeat" : ISODate("2016-04-04T14:23:11.319Z"),
                        "lastHeartbeatRecv" : ISODate("2016-04-04T14:23:10.372Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "DEV01:27019",
                        "configVersion" : 9
                },
                {
                        "_id" : 2,
                        "name" : "DEV03:27019",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 145849,
                        "optime" : {
                                "ts" : Timestamp(1459779789, 5),
                                "t" : NumberLong(12)
                        },
                        "optimeDate" : ISODate("2016-04-04T14:23:09Z"),
                        "lastHeartbeat" : ISODate("2016-04-04T14:23:11.213Z"),
                        "lastHeartbeatRecv" : ISODate("2016-04-04T14:23:10.268Z"),
                        "pingMs" : NumberLong(0),
                        "syncingTo" : "DEV01:27019",
                        "configVersion" : 9
                }
        ],
        "ok" : 1

csReplSetDev:PRIMARY> rs.printSlaveReplicationInfo()
source: DEV02:27019
        syncedTo: Mon Apr 04 2016 16:31:06 GMT+0200 (CEST)
        0 secs (0 hrs) behind the primary
source: DEV03:27019
        syncedTo: Mon Apr 04 2016 16:31:06 GMT+0200 (CEST)
        0 secs (0 hrs) behind the primary

Comment by Kelsey Schubert [ 04/Apr/16 ]

Hi PM,

Thank you for opening this ticket. Can you please clarify if you are using the legacy mirrored config servers or config servers as a replica set?

If you are using config servers as a replica set, please connect to the primary config server and provide the output of rs.status() and rs.printSlaveReplicationInfo()?

Thank you,
Thomas

Generated at Thu Feb 08 04:03:31 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.