[SERVER-31995] Logged initial sync statistics may exceed 16mb causing fassert Created: 16/Nov/17  Updated: 20/Dec/23  Resolved: 02/May/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.2
Fix Version/s: 3.4.16, 3.6.6, 4.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Benoit Bui Assignee: Benety Goh
Resolution: Fixed Votes: 0
Labels: initialSync
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File clust-users-2-shard3-2.log     Text File mongod-logs.txt    
Issue Links:
Backports
Related
is related to SERVER-84324 replSetGetStatus could asserts silent... Open
is related to SERVER-25125 Add initial sync progress information... Closed
is related to SERVER-27052 Add asynchronous operation support to... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6, v3.4
Participants:
Linked BF Score: 11

 Description   

Hi Team,

We have a 10 shards (Primary / Secondary / Arbiter) sharded cluster which hosts 70k databases.

Here's the repartition on the shards:
(NB: some of our databases are not sharded)

mongos> db.databases.aggregate({$group:{_id: '$primary', count: {$sum:1}}})
{ "_id" : "clust-users-2-shard10", "count" : 4594 }
{ "_id" : "clust-users-2-shard9", "count" : 8945 }
{ "_id" : "clust-users-2-shard8", "count" : 8624 }
{ "_id" : "clust-users-2-shard1", "count" : 8084 }
{ "_id" : "clust-users-2-shard7", "count" : 4505 }
{ "_id" : "clust-users-2-shard2", "count" : 4769 }
{ "_id" : "clust-users-2-shard6", "count" : 9370 }
{ "_id" : "clust-users-2-shard4", "count" : 4717 }
{ "_id" : "clust-users-2-shard3", "count" : 10217 }
{ "_id" : "clust-users-2-shard5", "count" : 5953 }

We're currently experiencing issues to resync this shard from scratch with the following error:

2017-11-16T05:49:33.245+0100 I -        [replication-115] Assertion: 10334:BSONObj size: 32985739 (0x1F7528B) is invalid. Size must be between 0 and 16793600(16MB) First element: databasesCloned: 10191 src/mongo/bson/bsonobj.cpp 58

On another cluster with the same architecture but less databases per shards, we do not encounter this issue.

We plan to upgrade from version 3.4.4 to 3.4.10 but we haven't found anything related to this issue in changelog.
Is this a known issue or do you have more information about this?

Thanks.

Regards,
Benoit



 Comments   
Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-31995 omit cloner stats from initial sync stats if too large

(cherry picked from commit 1a605a99bdc242fd957afdb6e9fe9b8f9c32c862)
Branch: v3.4
https://github.com/mongodb/mongo/commit/fa7b069d47d83c17789ec4c0796ec2eb99e422f0

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-31995 added characterization test for initial sync stats on a large database

(cherry picked from commit 6fdc0e7d43a3a6550f8d93ec2cdfa25e23a0bdba)
Branch: v3.4
https://github.com/mongodb/mongo/commit/083586da37b1258989f544a789b928c6b8683f88

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-31995 remove BSONObj validity check for 3.4 from InitialSyncer::getInitialSyncProgress()

(cherry picked from commit 880de296d30736f6005e283aa7737bb2f335dc61)
Branch: v3.6
https://github.com/mongodb/mongo/commit/e9add0a0cde6d9f9e6d5a5c994d76d3358cd1e00

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-31995 omit cloner stats from initial sync stats if too large

(cherry picked from commit 1a605a99bdc242fd957afdb6e9fe9b8f9c32c862)
Branch: v3.6
https://github.com/mongodb/mongo/commit/832f3f75dcb625d4d742ce4d23b121e5b6c6e508

Comment by Githook User [ 22/May/18 ]

Author:

{'username': 'benety', 'name': 'Benety Goh', 'email': 'benety@mongodb.com'}

Message: SERVER-31995 added characterization test for initial sync stats on a large database

(cherry picked from commit 6fdc0e7d43a3a6550f8d93ec2cdfa25e23a0bdba)
Branch: v3.6
https://github.com/mongodb/mongo/commit/3237d779ac7af34b3e809b70bf4897ca9a0f34de

Comment by Githook User [ 02/May/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-31995 remove BSONObj validity check for 3.4 from InitialSyncer::getInitialSyncProgress()
Branch: master
https://github.com/mongodb/mongo/commit/880de296d30736f6005e283aa7737bb2f335dc61

Comment by Githook User [ 02/May/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-31995 omit cloner stats from initial sync stats if too large
Branch: master
https://github.com/mongodb/mongo/commit/1a605a99bdc242fd957afdb6e9fe9b8f9c32c862

Comment by Githook User [ 02/May/18 ]

Author:

{'email': 'benety@mongodb.com', 'name': 'Benety Goh', 'username': 'benety'}

Message: SERVER-31995 added characterization test for initial sync stats on a large database
Branch: master
https://github.com/mongodb/mongo/commit/6fdc0e7d43a3a6550f8d93ec2cdfa25e23a0bdba

Comment by Systems [ 27/Apr/18 ]

Is there a current workaround to this?

When is the fix expected to be out for this issue?

Comment by Systems [ 27/Apr/18 ]

I am experiencing same issue here.

After loss of 2 of the 3 total replica members we attempt a re-sync from the remaining primary

After initial clean sync from primary completes mongod crashes (complete trace attached)

[replication-100] DBException::toString(): 10334 BSONObj size: 38694152 (0x24E6D08) is invalid. Size must be between 0 and 16793600(16MB) First element: databasesCloned: 56584
Actual exception type: mongo::MsgAssertionException

Upon mongod restart mongod attempts a full re-sync and fails

[replication-1] dropAllDatabasesExceptLocal 56592

db version v3.4.10
git version: 078f28920cb24de0dd479b5ea6c66c644f6326e9
OpenSSL version: OpenSSL 1.0.1t 3 May 2016
allocator: tcmalloc
modules: none
build environment:
distmod: debian81
distarch: x86_64
target_arch: x86_64

mongod-logs.txt

Comment by Spencer Brody (Inactive) [ 25/Apr/18 ]

The log message that causes this issue appears to have been added in https://github.com/mongodb/mongo/commit/c911be4e42994ad6106d12ca6a760c255e5d0452#diff-64adfcffbeb50a1887b7aa86d2689bfcR740, which landed in 3.3.12. So I don't expect 3.4.1 to be any better than 3.4.2 at avoiding this issue. This is a problem in all versions of 3.4.

Comment by Kelsey Schubert [ 16/Feb/18 ]

Hi BenoitSIB,

Sorry for the delay getting back to you, we've confirmed this issue and I've updated the ticket summary to better describe this issue and marked this ticket to be scheduled against currently planned work. Please continue to watch for updates.

Kind regards,
Kelsey

Comment by Anthony Brodard [ 13/Feb/18 ]

Hi Kelsey,

Are you able to reproduce the issue on your side ?

Regards,
Anthony

Comment by Anthony Brodard [ 01/Dec/17 ]

Hi Andy,

We have upgraded our cluster in the latest version (3.4.10). We still have the same issue.
Do you think a fix will be available in the next minor release ?

Thank you,
Anthony

Comment by Benoit Bui [ 16/Nov/17 ]

clust-users-2-shard3-2.log
Hi,
See the log attached.

Thanks.
Benoit

Comment by Andy Schwerin [ 16/Nov/17 ]

Can you upload the entire log? Or at least from the beginning of the extract you provided above to the end of the log file?

Comment by Anthony Brodard [ 16/Nov/17 ]

Hello,

The error seems to come from https://github.com/mongodb/mongo/blob/r3.4.4/src/mongo/db/repl/initial_syncer.cpp#L1073
This line seems used only for logging purposes. Can you confirm we can drop it and recompile mongo without impacting the synchronization process ?

Thank you,
Anthony

Comment by Benoit Bui [ 16/Nov/17 ]

Hi,

A more complete extract from the log:

2017-11-16T05:49:33.133+0100 I REPL     [replication-115] Finished fetching oplog during initial sync: CallbackCanceled: Callback cance
led. Last fetched optime and hash: { ts: Timestamp 1510807773000|11, t: 11 }[-3242675976573290350]
2017-11-16T05:49:33.133+0100 I REPL     [replication-115] Initial sync attempt finishing up.
2017-11-16T05:49:33.245+0100 I -        [replication-115] Assertion: 10334:BSONObj size: 32985739 (0x1F7528B) is invalid. Size must be between 0 and 16793600(16MB) First element: databasesCloned: 10191 src/mongo/bson/bsonobj.cpp 58
2017-11-16T05:49:33.268+0100 I CONTROL  [replication-115]
 0x7fb1aef20bc1 0x7fb1aeeaba17 0x7fb1ae1bb8e5 0x7fb1ae1bb9c4 0x7fb1ae255798 0x7fb1ae245b28 0x7fb1ae24b6c2 0x7fb1ae256292 0x7fb1ae256566
 0x7fb1aecbffad 0x7fb1ae8ec8fb 0x7fb1ae8f1ce5 0x7fb1ae229736 0x7fb1ae8f08ff 0x7fb1ae93108a 0x7fb1ae9311f7 0x7fb1ae936fa3 0x7fb1ae2beaf0
 0x7fb1aecae27a 0x7fb1aecb1683 0x7fb1aecb1b5b 0x7fb1aee98795 0x7fb1aee992c0 0x7fb1aee99e69 0x7fb1af98f690 0x7fb1ac5f2064 0x7fb1ac32762d
----- BEGIN BACKTRACE -----

So it finished the initial sync and failed (and crashed) right after.

Regards,
Benoit

Generated at Thu Feb 08 04:28:50 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.