[SERVER-1880] sharding issue after upgrading to 1.6.3 Created: 01/Oct/10  Updated: 12/Jul/16  Resolved: 11/Nov/10

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 1.6.3
Fix Version/s: 1.7.3

Type: Bug Priority: Major - P3
Reporter: Leo Zhadanovsky Assignee: Alberto Lerner
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu lucid lynx on EC2 - 1 router and config db instance, 10 shards on separate instances


Attachments: Text File mongos.log    
Operating System: Linux
Participants:

 Description   

Hi,

I upgraded mongo to 1.6.3 from 1.6.1 (using the apt repo) and after I restarted the config db, router and all the shards, I now get this in the mongo log:

Fri Oct 1 15:41:21 [Balancer] lastPing is empty! this could be bad:

{ _id: "balancer", process: "domU-12-31-39-18-04-BA:1283870654:1804289383", state: 1, ts: ObjectId('4c8a8e9473876ed6c88ea3a6'), when: new Date(1284148884286), who: "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", why: "doing balance round" }

What does this mean and how do I fix it?

Thanks,

  • Leo


 Comments   
Comment by auto [ 11/Nov/10 ]

Author:

{'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}

Message: SERVER-1880 do not clean lockpings, even old ones, if the lock is still taken
/mongodb/mongo/commit/6af410dde7ede94643e9cd291ac43ee873729c69

Comment by Alberto Lerner [ 11/Nov/10 ]

We managed to reproduce the problem. A fix is upcoming

Comment by Alberto Lerner [ 07/Oct/10 ]

Leo, what's the status on this?

Comment by Alberto Lerner [ 01/Oct/10 ]

Can you send me the config server log? I'd like to make sure there's no problem on that side.

If it clears, we may want to bounce mongos again with a higher verbose level.

Eliot and I added more logging to drill down on what's going on; that would be available in tonight's 1.7 nightly branch, if that was a possibility for you.

Comment by Leo Zhadanovsky [ 01/Oct/10 ]

I've run the repair and that query is now working, however, I am still seeing these errors in the mongos.log:

Fri Oct 1 22:14:01 [Balancer] lastPing is empty! this could be bad:

{ _id: "balancer", process: "domU-12-31-39-18-04-BA:1283870654:1804289383", state: 1, ts: ObjectId('4c8a8e9473876ed6c88ea3a6'), when: new Date(1284148884286), who: "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", why: "doing balance round" }
Comment by Alberto Lerner [ 01/Oct/10 ]

Can you track down the shard and run a repair?

Comment by Leo Zhadanovsky [ 01/Oct/10 ]

Also, I've found this in the log file for one of the shards:

Fri Oct 1 15:19:41 [conn4] assertion 13440 bad offset:0 accessing file: /mongodb/db/dnc_voters.0 - consider repairing database ns:dnc_voters.people query:

{ emails.email: "HORACE.L.WINDLEY@KUWAIT.SWA.ARMY.MIL" }

Should I run a repair on it?

Thanks,

  • Leo
Comment by Leo Zhadanovsky [ 01/Oct/10 ]

This happens when I run the following query:

> db.people.find(

{"emails.email": "HORACE.L.WINDLEY@KUWAIT.SWA.ARMY.MIL" }

)
error: {
"$err" : "bad offset:0 accessing file: /mongodb/db/dnc_voters.0 - consider repairing database",
"code" : 13440
}

Comment by Alberto Lerner [ 01/Oct/10 ]

Leo, the log that you attached has this:

Fri Oct 1 15:18:08 [conn1] DBException in process: bad offset:0 accessing file: /mongodb/db/dnc_voters.0 - consider repairing database

Can you check the shards' logs to so what could have happened around that time. Did you shutdown the shards before upgrading?

Also, a side note, if this is a production environment, please use three config servers.

Comment by Leo Zhadanovsky [ 01/Oct/10 ]

I bounced it and it appears to be doing the same thing:

Fri Oct 1 19:02:21 /usr/bin/mongos db version v1.6.3, pdfile version 4.5 starting (--help for usage)
Fri Oct 1 19:02:21 git version: 278bd2ac2f2efbee556f32c13c1b6803224d1c01
Fri Oct 1 19:02:21 sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Fri Oct 1 19:02:21 [websvr] web admin interface listening on port 28017
Fri Oct 1 19:02:21 waiting for connections on port 27017
Fri Oct 1 19:02:21 [Balancer] lastPing is empty! this could be bad:

{ _id: "balancer", process: "domU-12-31-39-18-04-BA:1283870654:1804289383", state: 1, ts: ObjectId('4c8a8e9473876ed6c88ea3a6'), when: new Date(1284148884286), who: "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", why: "doing balance round" }
Comment by Alberto Lerner [ 01/Oct/10 ]

Thanks. One last thing. Can you try to bounce the mongos?

Comment by Leo Zhadanovsky [ 01/Oct/10 ]

I have attached the file.

  • Leo
Comment by Alberto Lerner [ 01/Oct/10 ]

Could you send the mongos log?

Comment by Leo Zhadanovsky [ 01/Oct/10 ]

I am still getting the message:

Here is the command output:

> db.lockpings.find()

{ "_id" : "domU-12-31-39-18-1A-35:1283874817:1377189145", "ping" : "Fri Oct 01 2010 15:08:25 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-12-14:1283874202:1809101922", "ping" : "Fri Oct 01 2010 15:12:57 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-22-1B:1283879479:1230130915", "ping" : "Fri Oct 01 2010 15:07:07 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-33-AA:1283874601:1013128169", "ping" : "Fri Oct 01 2010 15:10:04 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-13-14:1283872978:1411183427", "ping" : "Fri Oct 01 2010 15:10:41 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-2E-6C:1283874277:1611283491", "ping" : "Fri Oct 01 2010 15:14:01 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-08-1C-5E:1283873502:72033941", "ping" : "Fri Oct 01 2010 14:56:21 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-01-1F:1283873734:849797026", "ping" : "Fri Oct 01 2010 14:57:10 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285944895:1804289383", "ping" : "Fri Oct 01 2010 14:57:55 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285946086:1804289383", "ping" : "Fri Oct 01 2010 15:17:16 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285946271:1804289383", "ping" : "Fri Oct 01 2010 18:26:54 GMT+0000 (UTC)" }

> db.locks.find()

{ "_id" : "balancer", "process" : "domU-12-31-39-18-04-BA:1283870654:1804289383", "state" : 1, "ts" : ObjectId("4c8a8e9473876ed6c88ea3a6"), "when" : "Fri Sep 10 2010 20:01:24 GMT+0000 (UTC)", "who" : "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", "why" : "doing balance round" }

{ "_id" : "dnc_voters.people", "process" : "domU-12-31-39-18-33-AA:1283874601:1013128169", "state" : 0, "ts" : ObjectId("4c8a5b647276c2c096171a59"), "when" : "Fri Sep 10 2010 16:23:00 GMT+0000 (UTC)", "who" : "domU-12-31-39-18-33-AA:1283874601:1013128169:conn18:1648965130", "why" : "migrate-

{ lname: \"CLEVELAND\", fname: \"CALVIN\" }

" }

Comment by Alberto Lerner [ 01/Oct/10 ]

Leo,

Did you get this message one time or are you still getting it?

Can you open a shell to the mongos and issue the following and send the result?
> use config
> db.lockpings().find()
> db.locks().find

Alberto.

Generated at Thu Feb 08 02:58:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.