[SERVER-9125] Unable to upgrade config metadata from v3 to v4 - 13127 getMore: cursor didn't exist on server, possible restart or timeout? Created: 25/Mar/13  Updated: 11/Jul/16  Resolved: 02/Apr/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.0, 2.4.1
Fix Version/s: 2.4.2, 2.5.0

Type: Bug Priority: Major - P3
Reporter: Nimi Wariboko Jr. Assignee: Alberto Lerner
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Config Server 1:
CentOS release 6.2 (Final)
db version v2.2.2, pdfile version 4.5
2GB Ram
1.5G in data directory

Config Server 2:
CentOS release 6.3 (Final)
db version v2.2.0, pdfile version 4.5
1GB Ram
1.5G in data directory

Config Server 3:
Ubuntu 10.04 Lucid (32 bit machine)
db version v2.2.2, pdfile version 4.5
512MB RAM
593 MB in data directory


Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

1.) Install mongos 2.4.1 on a machine
2.) Run mongos --configdb dbconf1:27019,dbconf2:27019,dbconf3:27019 --upgrade
3.) 20 minutes later, receive timeout.

Participants:

 Description   

When trying to perform an upgrade on the config servers, we get a timeout issue.

[code]
ERROR: error upgrading config database to v4 :: caused by :: error upgrading config database from v3 to v4 :: caused by :: could not copy config.chunks to config.chunks-upgrade-514cfc77fc5d397af47be301 :: caused by :: could not copy data into new collection :: caused by :: 13127 getMore: cursor didn't exist on server, possible restart or timeout?
[/code]

Our database consists of 18312, and of those, 16790 belong to a single collection.

I have attempted to repeat the upgrade many times, and the issue continues to occur.



 Comments   
Comment by Alberto Lerner [ 13/Apr/13 ]

Nimi,

Thanks for the feedback.

Might you be able to report also the size of your chunks/collection collections in config and how long the config migration process took?

They should all be in the log of the mongos you use with the --upgrade

Alberto.

Comment by Nimi Wariboko Jr. [ 12/Apr/13 ]

Successfully upgraded to v4 with 2.4.2 rc0

Comment by auto [ 09/Apr/13 ]

Author:

{u'date': u'2013-04-09T16:49:06Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Prevent cursor time out in the case of very slow config servers
Branch: v2.4
https://github.com/mongodb/mongo/commit/6edf4a7f081a4d2dc7c515a29908b79af5e94771

Comment by auto [ 09/Apr/13 ]

Author:

{u'date': u'2013-04-09T16:28:03Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Fixed batch size computation if first doc is large.
Branch: v2.4
https://github.com/mongodb/mongo/commit/9f073b6cae48457c6adb81cf399694da7b6d85e7

Comment by auto [ 09/Apr/13 ]

Author:

{u'date': u'2013-04-09T16:49:06Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Prevent cursor time out in the case of very slow config servers
Branch: master
https://github.com/mongodb/mongo/commit/694ad4140ad4433cf05f321b4ea348e54d00c46a

Comment by auto [ 09/Apr/13 ]

Author:

{u'date': u'2013-04-09T16:28:03Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Fixed batch size computation if first doc is large.
Branch: master
https://github.com/mongodb/mongo/commit/b8a78720e2d1bdbd2b6afe2f710131b19011de7e

Comment by auto [ 08/Apr/13 ]

Author:

{u'date': u'2013-04-08T23:32:36Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Fix memory ownership during large collection copy
Branch: v2.4
https://github.com/mongodb/mongo/commit/a3dae84adeeff7b062c0a969d5122d17b838bd4b

Comment by auto [ 08/Apr/13 ]

Author:

{u'date': u'2013-04-08T23:32:36Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Fix memory ownership during large collection copy
Branch: master
https://github.com/mongodb/mongo/commit/b5aa82c6783029da7fd21480d0271731c0c770f3

Comment by auto [ 02/Apr/13 ]

Author:

{u'date': u'2013-04-01T18:11:19Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Copy collections faster in the config upgrade procedure
Branch: v2.4
https://github.com/mongodb/mongo/commit/3bc93285244eaafe0f5f1d019a8eb74609134239

Comment by auto [ 01/Apr/13 ]

Author:

{u'date': u'2013-04-01T18:11:19Z', u'name': u'Alberto Lerner', u'email': u'alerner@10gen.com'}

Message: SERVER-9125 Copy collections faster in the config upgrade procedure
Branch: master
https://github.com/mongodb/mongo/commit/7c79805aa5ed14c573a07da705c888a22bf8853a

Comment by Alberto Lerner [ 27/Mar/13 ]

TL;DR: fix is upcoming, expected within 2.4.2 time frame.

====

Here's a little bit more clarification.

There's a special internal protocol to writing to config servers. Part of this protocol involves checks after writing every document. We want these checks. They're what allow a cluster to continue taking reads and writes if one of the config servers are down because the checks guarantee that the config servers always agree in content.

The checks have a cost though. For all the operations against the configs we've done so far, that cost was not an issue.

For the config upgrade procedure, though, we make backup copies (two, a working copy and a back up) of each collection that we're changing. (Recall that V2.4 config collections layout is slightly different than 2.2's. The upgrade process is what converts one lay out into the other.) The checks that we're doing in the config end up being too heavy for an entire collection copy – especially the chunks one.

So the upgrade takes long as each collection gets copied a single document at a time. The time out here is because, in some cases, it may take the upgrade process (at the start of a 2.4 mongos with --upgrade) longer than what it takes for a cursor to time out to actually issue a getMore on that cursor.

The upcoming fix will continue deploying the special checks that a config write incur – but we'd batch document when copying the collection so that the checks would be executed one per batch rather than one per document.

Comment by Alberto Lerner [ 26/Mar/13 ]

We have identified the problem. The fix is coming shortly, and so is an explanation of what's causing what you are observing.

Comment by Nimi Wariboko Jr. [ 26/Mar/13 ]

Sorry, I thought I had posted it.

http://pastebin.com/YvrG7f6Y

Comment by Gianfranco Palumbo [ 26/Mar/13 ]

Can you please upload the full log of the mongos?

Generated at Thu Feb 08 03:19:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.