[SERVER-18430] SECONDARY can not catch up oplog in upgrade progess Created: 12/May/15  Updated: 12/Jun/15  Resolved: 12/Jun/15

Status: Closed
Project: Core Server
Component/s: Admin, Replication
Affects Version/s: 2.6.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jiangcheng Wu Assignee: Sam Kleinman (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb.log.2     File mongodb.log.2.filterd    
Operating System: ALL
Participants:

 Description   

Hi, I tried to upgrade my 2.4.11/12 mixed sharded cluster to 2.6.9, but failed as SECONDARY( hidden, priority=0 ) can not catch up oplog.

CLUSTER STRUCTURE:
3 SHARDS, each mongodb instances listened on port 27018
3 configservers, each config server listened on port 27019
lots mongos deployed on app servers and listened on port 27020

PREPAREDNESS:
As guide http://docs.mongodb.org/manual/release-notes/2.6-upgrade/#preparedness said, we did some checks and found some problems descirbed on https://jira.mongodb.org/browse/SERVER-17746 .
And we added args --setParameter failIndexKeyTooLong=false to avoid write failure , and ignored other errors .

UPGRADE PROGESS:
As guide http://docs.mongodb.org/manual/release-notes/2.6-upgrade/#upgrade-a-sharded-cluster-to-2-6 said.
1, disabled write operations to metadata, done with no errors
2, disabled balancer, done with no errors
3, upgrade metad data, done with no errors
4, upgrade all mongos instanced to 2.6.9, done with no errors
5, stopped all 3 ARBITER role of each SHARD ( 3 in total ), as ARIBITER and mongoconfig running on the same box and intall mongodb-org-server(2.6.9) will remove package mongodb-10gen(2.4.x).
6, upgrade all 3 mongoconfig servers in reverse sequence of

leaving the first system in the mongos --configdb argument to upgrade last.

done with no errors.
7, picked a HIDDEN MEMBER of a SHARD, tried to upgrade and failed as SECONARY can not catch up oplog.
!https://dn-snitch.qbox.me/FniYFGNRRbaKW_g9eW4UV9W5YMIn?&e=1431423576&token=bbdM74CLuWzsgblM-J_FvC6N06iZVRNR7StZcXKG:An8_3bLbPdbM7JekiNTl-IBkktw=) !
and other PRIMARY and SECONDARY roles shows:

Tue May 12 02:27:51.581 [rsHealthPoll] replset info mgsh43.avoscloud.com:27018 thinks that we are down

look like this member can not contact with other members.

log of this memeber

1, lots of :
2015-05-12T02:26:32.025+0800 [IndexRebuilder] opening db {db names}.
2, lots of:
2015-05-12T02:34:53.153+0800 [rsHealthPoll] 0x1219651 0x11ba9d9 0x119d71e 0x11a8c6b 0xe4a6a4 0x11a8a1e 0x11a0882 0x125e419 0x7fbeff525182 0x7fbefe82a47d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x1219651]
 /usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x11ba9d9]
 /usr/bin/mongod(_ZN5mongo9wassertedEPKcS1_j+0x16e) [0x119d71e]
 /usr/bin/mongod(_ZN5mongo4task6Server4sendEN5boost8functionIFvvEEE+0x19b) [0x11a8c6b]
 /usr/bin/mongod(_ZN5mongo21ReplSetHealthPollTask6doWorkEv+0x544) [0xe4a6a4]
 /usr/bin/mongod(_ZN5mongo4task4Task3runEv+0x1e) [0x11a8a1e]
 /usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0xd2) [0x11a0882]
 /usr/bin/mongod() [0x125e419]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x8182) [0x7fbeff525182]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbefe82a47d]

I've attached the full log and filter log of this member, and mgcfg3.avoscloud.com(10.10.17.10) in the log is ARBITER.

and mongodb-org-server:
db version v2.6.9
git version: df313bc75aa94d192330cb92756fc486ea604e64

I don't know why this happen. May be it is a bug of 2.6.9?



 Comments   
Comment by Sam Kleinman (Inactive) [ 12/Jun/15 ]

Thanks for this information. In 2.6, on startup mongod must open each database while starting up to clean up aborted index builds. Just to be clear, the string "IndexRebuilder" refers to the name of the thread doing this work and not to the action its performing. I'm going to go ahead and close this ticket, as the SERVER project is for bugs with MongoDB itself, and questions about performance and operations are probably better addressed to one of our support channels.

Regards,
sam

Comment by Jiangcheng Wu [ 10/Jun/15 ]

Hi,
1. start up to listen on port, 2.4: 22 minutes, 2.6.10: 20 minutes.
IndexRebuilder time, 2.4: none, 2.6.10: 13 minutes
become into SECONDARY, 2.4: 0 second, 2.6.10: 34 minutes
and in total, 2.4 is 22 minutes and 2.6.10 is 67 minutes
2, Yes
3, Yes
4, There are 10930 databases( and approximately more than 70,000 collections with indexes, and nearly 90% collections may not have store any records ).
There are 33053 files in dbPath.
5, SSD and 15k SAS disk, and start up time of SSD is half of 15k SAS disks.

IndexRebuild progress cost much time, even we have set --noIndexBuildRetry, and the count number of keyword

[IndexRebuilder] opening db: ...

in start up log is almost equal to database counts. I can not tell is it rebuilding index or just checking index from log.

Thank you.

Comment by Sam Kleinman (Inactive) [ 09/Jun/15 ]

I'm glad to hear that you've successfully upgraded now. I just want to ask a few questions about your deployment to better understand the difference in start up time?

  1. In absolute time how long does it take for a mongod to start up with 2.6.10 versus 2.4?
  2. Is the start up time consistent across all members of the set?
  3. Are start up times consistent between restarts?
  4. Exactly how many databases does your data set have? How many files are in your dbPath?
  5. What kind of underlying storage system are you using (SSD? local storage, network storage?)

Thanks,
Sam

Comment by Jiangcheng Wu [ 03/Jun/15 ]

Hi

  • Networking is good. Lots of connections to arbiter role(10.10.17.10) refused as arbiter was stopped at that time.
  • Yes, the ARBITER role was stopped at that time.
  • Maybe, I'm not sure if there have any index build in progress before the upgrade. And Yes, My instance's workload include a large number of database and collection creation options( include indexes) . But at that period of upgrade, applications had not create any databases or collections, I'm sure of that as we deployed applications without the ability to change metadata before upgrade. Or maybe there are mistake/problem with indexes?

We upgraded our production mongo cluster to 2.6.10 a few days ago. Everything looks ok, except the startup time of mongod instance is about triple than 2.4.

Comment by Sam Kleinman (Inactive) [ 27/May/15 ]

As I look through these logs and reports, a couple of things stand out at me:

  • The logs contain a large number of interprocess mongod connection errors. Have any networking connections changed?
  • From your notes it doesn't look like you restarted the arbiter after upgrading the instance. Is this correct? Was the arbiter available and a member of the set when you were attempting to upgrade the shard? If the arbiters are not running when you attempt to upgrade the hidden member, could to attempt the upgrade with them running?
  • It appears as if the members of the replica set are attempting to build indexes based on the log? Was an index build in progress before the upgrade? Does your instance's workload include a large number of database and collection creation options,

If you could provide answers to these questions it will help us understand what's going on here.

Regards,
sam

Comment by Jiangcheng Wu [ 22/May/15 ]

Hi,
We have copied one SHARD's data and tested the upgrade procedure, and succeed. Upgrade process take too much ( 3x ) time than just running up.
We will upgrade our production cluster later.
Thank you, guys.

Comment by Jiangcheng Wu [ 12/May/15 ]

add an image of SlaveReplicationInfo(), with weird timestamp

finally, the upgrade process was canceled. And mgsh43 downgrade to 2.4.14, this member can catch up with other members.

MongoDB shell version: 2.4.14
connecting to: localhost:27018/test
SHARD04:SECONDARY> db.printSlaveReplicationInfo()
source:   mgsh43.avoscloud.com:27018
	 syncedTo: Tue May 12 2015 19:09:15 GMT+0800 (CST)
		 = 1 secs ago (0hrs)
source:   mgsh44.avoscloud.com:27018
	 syncedTo: Tue May 12 2015 19:09:14 GMT+0800 (CST)
		 = 2 secs ago (0hrs)
source:   rtm42.avoscloud.com:27018
	 syncedTo: Tue May 12 2015 19:09:14 GMT+0800 (CST)
		 = 2 secs ago (0hrs)
source:   mgcfg3.avoscloud.com:27018
	 no replication info, yet.  State: ARBITER
source:   rtm33.avoscloud.com:27018
	 syncedTo: Tue May 12 2015 19:09:14 GMT+0800 (CST)
		 = 2 secs ago (0hrs)

Generated at Thu Feb 08 03:47:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.