[SERVER-28516] index corruption Created: 27/Mar/17  Updated: 18/Apr/17  Resolved: 29/Mar/17

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.2
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Brian Nelson Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: uic
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.8


Issue Links:
Duplicate
duplicates SERVER-28546 Documents can erroneously be unindexe... Closed
Participants:

 Description   

We're experiencing regular index corruption on a table with a partialIndexFilter, though it is not always that index getting corrupted. The only relevant thing I could find near the time when the corruption is occurring is log rotation via SIGUSR1. Nothing else in the log file seemed suspect.

Unfortunately our data is very sensitive and cannot be shared. I can tell you we're using Ruby/Rails and the mongoid_paranoia gem and set up a unique partialFilterExpression with:

	{
		"v" : 1,
		"unique" : true,
		"key" : {
			"name" : 1
		},
		"name" : "unique_active_names",
		"ns" : "api_production.app_instances",
		"partialFilterExpression" : {
			"deleted_at" : null
		}
	},

The corruption appears to have only been exposed with this added but may have happened before and gone unnoticed. We found it when being unable to find documents with this index using

{deleted_at: null, name: "name"}



 Comments   
Comment by Eric Milkie [ 29/Mar/17 ]

Thanks for your report! Please follow the linked ticket for further updates.

Comment by Eric Milkie [ 29/Mar/17 ]

This bug does not look like it has an easy fix, so I expect it will take more than one week. After we have coded up a fix, we can expedite a minor version release and have that ready in two or three weeks after that. It's our top priority right now.

Comment by Brian Nelson [ 29/Mar/17 ]

Thanks! Looking forward to the fix. Is there an estimation as to how long this typically takes before there is a release with a fix?

Comment by Eric Milkie [ 29/Mar/17 ]

We've discovered a problem in the code when you use WiredTiger and have a unique index with a partialFilterExpression. If you delete a document that is not matched by the partialFilterExpression, all documents with matching keys for that particular index will be removed from the index.

Here is an example using your index spec provided in the Description above:

> use api_production
> db.app_instances.insert({_id: 1, name: 1, deleted_at: 1})
> db.app_instances.insert({_id: 2, name: 1})
// At this point, only document _id:2 is indexed, as per partial index filter.
> db.app_instances.remove({_id: 1})
// _id:2 is erroneously removed from the index.

We are actively working on a solution for this.

Comment by Brian Nelson [ 28/Mar/17 ]

No, it worked just fine with the reIndex. The names don't get updated very frequently and we check them elsewhere in the code, the index is a fallback for certain cases. But there were no failures on the reIndex.

Comment by Eric Milkie [ 28/Mar/17 ]

Do you run a rebuild by executing the "reIndex" command? If so, does it ever detect records that would violate the unique constraint? I would expect the reindex command to fail at that point, if it did.

Comment by Brian Nelson [ 27/Mar/17 ]

Additionally, running a rebuild of the indexes seems to resolve the issue temporarily, though it keeps happening. Also this only seems to affect the primary.

Comment by Brian Nelson [ 27/Mar/17 ]

I'll work on getting all the information to you, but in the meantime:

  • Using standard object IDs
  • Yes, we do many updates frequently on all the non-deleted objects
  • Yes, we have many indexes. Some are old and unused. I will work on getting them to you.

Total and in index counts:

count()
16052
find({deleted_at: null}).count()
8226

initandlisten / signalProcessingThread

2017-03-16T16:29:32.309+0000 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-03-16T16:29:32.309+0000 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2017-03-16T16:29:32.327+0000 I NETWORK  [signalProcessingThread] closing listening socket: 7
2017-03-16T16:29:32.327+0000 I NETWORK  [signalProcessingThread] closing listening socket: 8
2017-03-16T16:29:32.327+0000 I NETWORK  [signalProcessingThread] removing socket file: /tmp/mongodb-27017.sock
2017-03-16T16:29:32.328+0000 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2017-03-16T16:29:32.328+0000 I REPL     [signalProcessingThread] shutting down replication subsystems
2017-03-16T16:29:32.340+0000 I REPL     [signalProcessingThread] Stopping replication reporter thread
2017-03-16T16:29:32.340+0000 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to 10.26.1.50:27017: CallbackCanceled: Reporter no longer valid
2017-03-16T16:29:32.340+0000 I REPL     [signalProcessingThread] Stopping replication fetcher thread
2017-03-16T16:29:32.341+0000 I REPL     [signalProcessingThread] Stopping replication applier thread
2017-03-16T16:29:32.346+0000 I REPL     [signalProcessingThread] Stopping replication storage threads
2017-03-16T16:29:32.346+0000 W EXECUTOR [signalProcessingThread] killCursors command task failed: CallbackCanceled: Callback canceled
2017-03-16T16:29:32.374+0000 I -        [conn65119] end connection 10.26.1.50:60380 (16 connections now open)
2017-03-16T16:29:32.385+0000 I FTDC     [signalProcessingThread] Shutting down full-time diagnostic data capture
2017-03-16T16:29:32.420+0000 I STORAGE  [signalProcessingThread] WiredTigerKVEngine shutting down
2017-03-16T16:29:32.544+0000 I -        [conn65111] end connection 10.26.2.50:58316 (15 connections now open)
2017-03-16T16:29:32.544+0000 I -        [conn65117] end connection 10.26.2.50:58338 (14 connections now open)
2017-03-16T16:29:33.098+0000 I STORAGE  [signalProcessingThread] shutdown: removing fs lock...
2017-03-16T16:29:33.100+0000 I CONTROL  [signalProcessingThread] now exiting
2017-03-16T16:29:33.100+0000 I CONTROL  [signalProcessingThread] shutting down with code:0
2017-03-16T16:29:33.100+0000 I CONTROL  [initandlisten] shutting down with code:0
2017-03-16T16:30:33.787+0000 I CONTROL  [main] ***** SERVER RESTARTED *****
2017-03-16T16:30:33.829+0000 I CONTROL  [initandlisten] MongoDB starting : pid=28158 port=27017 dbpath=/mnt/mongo/data 64-bit host=REDACTED
2017-03-16T16:30:33.829+0000 I CONTROL  [initandlisten] db version v3.4.2
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] git version: 3f76e40c105fc223b3e5aac3e20dcd026b83b38b
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] OpenSSL version: OpenSSL 1.0.1e-fips 11 Feb 2013
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] allocator: tcmalloc
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] modules: none
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] build environment:
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten]     distmod: rhel62
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten]     distarch: x86_64
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten]     target_arch: x86_64
2017-03-16T16:30:33.830+0000 I CONTROL  [initandlisten] options: { config: "/etc/mongod.conf", net: { port: 27017 }, processManagement: { fork: true, pidFilePath: "/var/run/mongodb/mongod.pid" }, replication: { replSetName: "REDACTED" }, security: { keyFile: "/etc/mongo/REDACTED.key" }, storage: { dbPath: "/mnt/mongo/data", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 4.0 } } }, systemLog: { destination: "file", logAppend: true, path: "/var/log/mongodb/mongod.log" } }
2017-03-16T16:30:33.852+0000 I STORAGE  [initandlisten] 
2017-03-16T16:30:33.852+0000 I STORAGE  [initandlisten] ** WARNING: Using the XFS filesystem is strongly recommended with the WiredTiger storage engine
2017-03-16T16:30:33.852+0000 I STORAGE  [initandlisten] **          See http://dochub.mongodb.org/core/prodnotes-filesystem
2017-03-16T16:30:33.852+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=4096M,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2017-03-16T16:30:33.975+0000 I FTDC     [initandlisten] Initializing full-time diagnostic data capture with directory '/mnt/mongo/data/diagnostic.data'
2017-03-16T16:30:33.988+0000 I REPL     [initandlisten] Did not find local voted for document at startup.
2017-03-16T16:30:33.988+0000 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument: Did not find replica set configuration document in local.system.replset
2017-03-16T16:30:33.989+0000 I NETWORK  [thread1] waiting for connections on port 27017

Comment by Eric Milkie [ 27/Mar/17 ]

Hi Brian,
I'm investigating your issue. Some initial questions:

  • Are you using custom _id values in your inserted documents, or are you using the default ObjectId's? If possible, can I see an example document inserted into this collection? You can munge the data as necessary; I'm more interested in the presence of fields and their types (and how those fields match the fields mentioned in the index spec).
  • For the documents that should be indexed but are not, do you know if they have been manipulated by updates? Or, are the documents only initially inserted and then not touched after that?
  • Do you have other indexes on this collection other than the one you provided and the _id index? If so, can you share their specs (munged to hide sensitive data, if necessary)?
  • I would be interested to know how many documents you have in the collection, how many are present in the index, and how many are expected to be present in the index.
  • Could you provide the initial log messages from the startup of the server? That is, all messages until the first "waiting for connections on port" message.
    Thanks,
    Eric
Comment by Brian Nelson [ 27/Mar/17 ]

Sorry meant "index corruption" and "collection" not "table". Cannot seem to edit this issue.

Generated at Thu Feb 08 04:18:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.