[SERVER-1880] sharding issue after upgrading to 1.6.3 Created: 01/Oct/10 Updated: 12/Jul/16 Resolved: 11/Nov/10 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 1.6.3 |
| Fix Version/s: | 1.7.3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Leo Zhadanovsky | Assignee: | Alberto Lerner |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu lucid lynx on EC2 - 1 router and config db instance, 10 shards on separate instances |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
Hi, I upgraded mongo to 1.6.3 from 1.6.1 (using the apt repo) and after I restarted the config db, router and all the shards, I now get this in the mongo log: Fri Oct 1 15:41:21 [Balancer] lastPing is empty! this could be bad: { _id: "balancer", process: "domU-12-31-39-18-04-BA:1283870654:1804289383", state: 1, ts: ObjectId('4c8a8e9473876ed6c88ea3a6'), when: new Date(1284148884286), who: "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", why: "doing balance round" }What does this mean and how do I fix it? Thanks,
|
| Comments |
| Comment by auto [ 11/Nov/10 ] |
|
Author: {'login': 'alerner', 'name': 'Alberto Lerner', 'email': 'alerner@10gen.com'}Message: |
| Comment by Alberto Lerner [ 11/Nov/10 ] |
|
We managed to reproduce the problem. A fix is upcoming |
| Comment by Alberto Lerner [ 07/Oct/10 ] |
|
Leo, what's the status on this? |
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Can you send me the config server log? I'd like to make sure there's no problem on that side. If it clears, we may want to bounce mongos again with a higher verbose level. Eliot and I added more logging to drill down on what's going on; that would be available in tonight's 1.7 nightly branch, if that was a possibility for you. |
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
I've run the repair and that query is now working, however, I am still seeing these errors in the mongos.log: Fri Oct 1 22:14:01 [Balancer] lastPing is empty! this could be bad: { _id: "balancer", process: "domU-12-31-39-18-04-BA:1283870654:1804289383", state: 1, ts: ObjectId('4c8a8e9473876ed6c88ea3a6'), when: new Date(1284148884286), who: "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", why: "doing balance round" } |
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Can you track down the shard and run a repair? |
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
Also, I've found this in the log file for one of the shards: Fri Oct 1 15:19:41 [conn4] assertion 13440 bad offset:0 accessing file: /mongodb/db/dnc_voters.0 - consider repairing database ns:dnc_voters.people query: { emails.email: "HORACE.L.WINDLEY@KUWAIT.SWA.ARMY.MIL" }Should I run a repair on it? Thanks,
|
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
This happens when I run the following query: > db.people.find( {"emails.email": "HORACE.L.WINDLEY@KUWAIT.SWA.ARMY.MIL" }) |
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Leo, the log that you attached has this: Fri Oct 1 15:18:08 [conn1] DBException in process: bad offset:0 accessing file: /mongodb/db/dnc_voters.0 - consider repairing database Can you check the shards' logs to so what could have happened around that time. Did you shutdown the shards before upgrading? Also, a side note, if this is a production environment, please use three config servers. |
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
I bounced it and it appears to be doing the same thing: Fri Oct 1 19:02:21 /usr/bin/mongos db version v1.6.3, pdfile version 4.5 starting (--help for usage) |
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Thanks. One last thing. Can you try to bounce the mongos? |
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
I have attached the file.
|
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Could you send the mongos log? |
| Comment by Leo Zhadanovsky [ 01/Oct/10 ] |
|
I am still getting the message: Here is the command output: > db.lockpings.find() { "_id" : "domU-12-31-39-18-1A-35:1283874817:1377189145", "ping" : "Fri Oct 01 2010 15:08:25 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-12-14:1283874202:1809101922", "ping" : "Fri Oct 01 2010 15:12:57 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-22-1B:1283879479:1230130915", "ping" : "Fri Oct 01 2010 15:07:07 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-33-AA:1283874601:1013128169", "ping" : "Fri Oct 01 2010 15:10:04 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-13-14:1283872978:1411183427", "ping" : "Fri Oct 01 2010 15:10:41 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-2E-6C:1283874277:1611283491", "ping" : "Fri Oct 01 2010 15:14:01 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-08-1C-5E:1283873502:72033941", "ping" : "Fri Oct 01 2010 14:56:21 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-01-1F:1283873734:849797026", "ping" : "Fri Oct 01 2010 14:57:10 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285944895:1804289383", "ping" : "Fri Oct 01 2010 14:57:55 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285946086:1804289383", "ping" : "Fri Oct 01 2010 15:17:16 GMT+0000 (UTC)" } { "_id" : "domU-12-31-39-18-04-BA:1285946271:1804289383", "ping" : "Fri Oct 01 2010 18:26:54 GMT+0000 (UTC)" }> db.locks.find() { "_id" : "balancer", "process" : "domU-12-31-39-18-04-BA:1283870654:1804289383", "state" : 1, "ts" : ObjectId("4c8a8e9473876ed6c88ea3a6"), "when" : "Fri Sep 10 2010 20:01:24 GMT+0000 (UTC)", "who" : "domU-12-31-39-18-04-BA:1283870654:1804289383:Balancer:846930886", "why" : "doing balance round" }{ "_id" : "dnc_voters.people", "process" : "domU-12-31-39-18-33-AA:1283874601:1013128169", "state" : 0, "ts" : ObjectId("4c8a5b647276c2c096171a59"), "when" : "Fri Sep 10 2010 16:23:00 GMT+0000 (UTC)", "who" : "domU-12-31-39-18-33-AA:1283874601:1013128169:conn18:1648965130", "why" : "migrate- { lname: \"CLEVELAND\", fname: \"CALVIN\" }" } |
| Comment by Alberto Lerner [ 01/Oct/10 ] |
|
Leo, Did you get this message one time or are you still getting it? Can you open a shell to the mongos and issue the following and send the result? Alberto. |