[SERVER-15798] Helpers::removeRange does not check if node is primary Created: 24/Oct/14  Updated: 24/Dec/14  Resolved: 22/Dec/14

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 2.6.4, 2.8.0-rc0
Fix Version/s: 2.8.0-rc4

Type: Bug Priority: Critical - P2
Reporter: Alex Piggott Assignee: Randolph Tan
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-16115 Helpers::removeRange should check if ... Closed
Related
is related to SERVER-14551 Runner yield during migration cleanup... Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

4x server containing 2 replica sets, running Amazon Linux (~Centos6) with all running 2.6.4:

[root@ip-10-60-18-179 mongo]# rpm -qa | grep mongo
mongodb-org-mongos-2.6.4-1.x86_64
mongodb-org-2.6.4-1.x86_64
mongodb-org-shell-2.6.4-1.x86_64
mongodb-org-tools-2.6.4-1.x86_64
mongodb-org-server-2.6.4-1.x86_64

Participants:

 Description   

Original title:

rs.stepDown during migration causes fassert in logOp

Original description:

At 3am cron issues an rs.stepDown() as part of weekly compaction - having read SERVER-14186 I assume the migrate chunk errors are not relevant, but I've left them in the logs in case

2014-10-04T03:04:57.225+0000 [migrateThread] doc_metadata.metadata  caught assertion addKeysToIndex doc_metadata.metadata.$url_1_sourceKey_1_id: ObjectId('519239f8e4b0be945d6b9fe6')
2014-10-04T03:04:57.225+0000 [migrateThread] scoped connection to replica_set1/10.40.131.234:27018,10.40.213.215:27018 not being returned to the pool
2014-10-04T03:04:57.225+0000 [migrateThread] about to log metadata event: { _id: "ip-10-60-18-179-2014-10-04T03:04:57-542f63d9417de929885b7e2e", server: "ip-10-60-18-179", clientAddr: ":27017", time: new Date(1412391897225), what: "moveChunk.to", ns: "doc_metadata.metadata", details: { min: { _id: ObjectId('5191d4f3e4b07927e24302bc') }, max: { _id: ObjectId('51926d79e4b056bb2e3823fc') }, step 1 of 5: 2, step 2 of 5: 5296, note: "aborted" } }
2014-10-04T03:04:57.258+0000 [migrateThread] ERROR: migrate failed: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing doc_metadata.metadata.$url_1_sourceKey_1 1163 { : "http://www.shippingonline.cn/Yellowpages/world_list.asp?page=164&user_addr='''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''...", : "www.google.com.search.64.84." }
2014-10-04T03:05:01.748+0000 [conn528628] end connection 10.226.114.59:33504 (272 connections now open)
2014-10-04T03:05:01.748+0000 [initandlisten] connection accepted from 10.226.114.59:33525 #528632 (273 connections now open)
2014-10-04T03:05:01.887+0000 [conn478926] remotely refreshing metadata for doc_metadata.metadata based on current shard version 660|1||53da748318304c3808174ba3, current metadata version is 660|7||53da748318304c3808174ba3
2014-10-04T03:05:01.889+0000 [conn478926] metadata of collection doc_metadata.metadata already up to date (shard version : 660|1||53da748318304c3808174ba3, took 1ms)
2014-10-04T03:05:01.903+0000 [migrateThread] starting receiving-end of migration of chunk { _id: ObjectId('5191d4f3e4b07927e24302bc') } -> { _id: ObjectId('51926d79e4b056bb2e3823fc') } for collection doc_metadata.metadata from replica_set1/10.40.131.234:27018,10.40.213.215:27018 at epoch 53da748318304c3808174ba3
2014-10-04T03:05:01.985+0000 [conn528629] end connection 10.245.78.137:48391 (272 connections now open)
2014-10-04T03:05:01.987+0000 [initandlisten] connection accepted from 10.245.78.137:48402 #528633 (273 connections now open)
2014-10-04T03:05:05.819+0000 [initandlisten] connection accepted from 127.0.0.1:37058 #528634 (274 connections now open)
2014-10-04T03:05:05.824+0000 [conn528634] end connection 127.0.0.1:37058 (273 connections now open)
2014-10-04T03:05:05.900+0000 [initandlisten] connection accepted from 127.0.0.1:37059 #528635 (274 connections now open)
2014-10-04T03:05:05.929+0000 [conn528635] 10.226.114.59:27019 is now closest at 542f63e1:37
2014-10-04T03:05:05.929+0000 [conn528635] replSet info stepping down as primary secs=60
2014-10-04T03:05:05.939+0000 [conn528635] replSet relinquishing primary state
2014-10-04T03:05:05.939+0000 [conn528635] replSet SECONDARY
 
 
(lots of)
2014-10-04T03:05:05.940+0000 [conn478632] end connection 10.40.213.215:59486 (273 connections now open)
 
 
2014-10-04T03:05:05.951+0000 [migrateThread] replSet error : logOp() but not primary
2014-10-04T03:05:05.951+0000 [migrateThread] doc_metadata.metadata Fatal Assertion 17405
2014-10-04T03:05:06.000+0000 [initandlisten] connection accepted from 10.40.131.234:33182 #528636 (17 connections now open)
2014-10-04T03:05:06.008+0000 [initandlisten] connection accepted from 127.0.0.1:37060 #528637 (18 connections now open)
2014-10-04T03:05:06.067+0000 [conn528635] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [127.0.0.1:37059]
2014-10-04T03:05:06.081+0000 [initandlisten] connection accepted from 10.226.114.59:33553 #528638 (18 connections now open)
2014-10-04T03:05:06.126+0000 [migrateThread] doc_metadata.metadata 0x11e6111 0x1187e49 0x116a97d 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d
 mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
 mongod(_ZN5mongo10logContextEPKc+0x159) [0x1187e49]
 mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116a97d]
 mongod() [0xe577aa]
 mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e]
 mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139]
 mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594]
 mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178]
 mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839]
 mongod() [0x122aaa9]
 /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851]
 /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d]
2014-10-04T03:05:06.126+0000 [migrateThread]
 
***aborting after fassert() failure
 
 
2014-10-04T03:05:06.133+0000 [migrateThread] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11e6111 0x11e54ee 0x7f72afa02920 0x7f72afa028a5 0x7f72afa04085 0x116a9ea 0xe577aa 0xe5000e 0xa4d139 0xfd5594 0xfda178 0xfc5839 0x122aaa9 0x7f72b0715851 0x7f72afab811d
 mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e6111]
 mongod() [0x11e54ee]
 /lib64/libc.so.6(+0x32920) [0x7f72afa02920]
 /lib64/libc.so.6(gsignal+0x35) [0x7f72afa028a5]
 /lib64/libc.so.6(abort+0x175) [0x7f72afa04085]
 mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116a9ea]
 mongod() [0xe577aa]
 mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_PbbPS3_+0xee) [0xe5000e]
 mongod(_ZN5mongo7Helpers11removeRangeERKNS_8KeyRangeEbbPNS0_11RemoveSaverEbb+0xb09) [0xa4d139]
 mongod(_ZN5mongo13MigrateStatus3_goEv+0x1644) [0xfd5594]
 mongod(_ZN5mongo13MigrateStatus2goEv+0x28) [0xfda178]
 mongod(_ZN5mongo13migrateThreadEv+0x59) [0xfc5839]
 mongod() [0x122aaa9]
 /lib64/libpthread.so.0(+0x7851) [0x7f72b0715851]
 /lib64/libc.so.6(clone+0x6d) [0x7f72afab811d]

(Other issue - the primary node was running at 200% repeating the same log messages over and over again 10x per second:

2014-10-24T13:02:10.289+0000 [rsHealthPoll] warning: Failed to connect to 10.60.18.179:27019, reason: errno:111 Connection refused
2014-10-24T13:02:10.289+0000 [rsHealthPoll] couldn't connect to 10.60.18.179:27019: couldn't connect to server 10.60.18.179:27019 (10.60.18.179) failed, connection attempt failed



 Comments   
Comment by Githook User [ 22/Dec/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-15798 Helpers::removeRange does not check if node is primary
Branch: master
https://github.com/mongodb/mongo/commit/bf0cdcee577aa5a05ba65506d8bd47f33eac30d2

Comment by Randolph Tan [ 12/Nov/14 ]

Note: Other write ops like insert, delete and update performs this check after grabbing the exclusive lock.

Comment by Alex Piggott [ 12/Nov/14 ]

Thanks! I might leave that one running overnight....

Comment by Ramon Fernandez Marina [ 12/Nov/14 ]

Hi apiggott@ikanow.com, sorry to hear you experienced another crash. You can find the documents with large values using a query like the following:

db.foo.find({$where:"this.url.length > 1023"})

Replace the value of url above with the appropriate key path in your documents.

I'm trying to reproduce the issue on this end, and a developer is looking at the code as well.

Comment by Alex Piggott [ 12/Nov/14 ]

Incidentally is there a slick way I can identify which records in my DB have this too long problem (apart from by looking in the logs each time it crashed - it crashed again today!) and deleting the specific one it cared about (forcing it to do a table scan lookup against the URL somehow)?

Comment by Alex Piggott [ 28/Oct/14 ]

(Incidentally isn't failIndexKeyTooLong the opposite of what I want ... since it will allow documents with long index keys to be inserted moving forward. My problem is that I have existing docs with long index keys because 2.4- allowed the insertion of such fields. At least now no new problematic documents are getting inserted, I just need to fix the ones that are already in the DB)

[EDIT: oh or do you mean that the chunk migration won't fail any more if failIndexKeyTooLong is false? That would make sense! Though I think discarding any future "corrupt" docs is the better default]

Comment by Alex Piggott [ 28/Oct/14 ]

Thanks for looking into it - to clarify the first error I don't care about at all, I just included it in case it was the root cause of the fassert. Preventing the insertion of long fields is certainly my problem to fix!

However, obviously even if I am trying to insert keys that are too long then that shouldn't crash the DB, so the purpose of this issue was to let you know that it was happening (or something along those lines - my guess would be any migration can cause the problem, it was just more likely to occur given the repeated attempts due to the long index failure).

Unfortunately the database on which this occurred is operational so I can't mess about trying to make it fail. Presumably the steps would be:
1) create a DB with multiple shards
2) insert a record with a long index
3) migrate the chunk containing the long index
4) confirm that it was repeatedly trying and failing to move the chunk in the logs
5) try the rs.stepDown()

Though presumably that's what you've been trying without luck? Since this is the first time the DB has crashed in about 2 years, and each of 2 shards has been stepping down every week (and I think I've seen "index too long" errors since forever), I'm guessing it might be quite a low probability event...

Did you ping the dev who fixed SERVER-14186 to see if he just missed a case in his patch?

Comment by Ramon Fernandez Marina [ 28/Oct/14 ]

Hi apiggott@ikanow.com,

the first error I see in the log snippet you sent is related to a migration failure:

2014-10-04T03:04:57.258+0000 [migrateThread] ERROR: migrate failed: insertDocument :: caused by :: 17280 Btree::insert: key too large to index, failing doc_metadata.metadata.$url_1_sourceKey_1 1163 { : "http://www.shippingonline.cn/Yellowpages/world_list.asp?page=164&user_addr='''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''...", : "www.google.com.search.64.84." }

Please see the list of compatibility changes in MongoDB 2.6 for more details about this issue. In your particular case:

Chunk Migration will fail:

  • Migrations will fail for a chunk that has a document with an indexed field whose index entry exceeds the limit.
  • If left unfixed, the chunk will repeatedly fail migration, effectively ceasing chunk balancing for that collection. Or, if chunk splits occur in response to the migration failures, this response would lead to unnecessarily large number of chunks and an overly large config databases.

You may need to set failIndexKeyTooLong to false and/or fix the offending documents first:

db.getSiblingDB('admin').runCommand( { setParameter: 1, failIndexKeyTooLong: false } )

Then there's the fatal assertion message:

2014-10-04T03:05:05.951+0000 [migrateThread] replSet error : logOp() but not primary
2014-10-04T03:05:05.951+0000 [migrateThread] doc_metadata.metadata Fatal Assertion 17405

I'm trying to reproduce this error, unsuccessfully so far, so if you have a reproducing script it would be of great help. In the meantime, can you please review the issue described above with long keys and let us know if rs.stepDown() still triggers the assertion?

Comment by Alex Piggott [ 24/Oct/14 ]

(When I say "exactly the same issue", I mean "the same problem except during migration not map/reduce"! Migration wasn't mentioned in SERVER-14186 so you probably just missed a few places in the original fix?

Generated at Thu Feb 08 03:39:01 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.