[SERVER-3330] Sporadically missing data from a query Created: 24/Jun/11  Updated: 29/Aug/11  Resolved: 27/Jun/11

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 1.8.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Michael D. Norman Assignee: Aaron Staple
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Driver: Java 2.6.3

CentOS release 5.6
(Final), running 2.6.22.19-vs2.3.0.34.1 x86_64. The server is a VM
running on the same physical hardware (though a different VM) as the
web application querying the DB. File system where the data is stored
is ufs with plenty of free space. Total system memory is 32GB, mostly
used by mongo with 1.5GB free.


Operating System: ALL
Participants:

 Description   

The behavior we are seeing
is that when we run a query, it will most of the time return a set of
results that make sense, and then once out of 1000 queries, it will be
missing a single record. I've tried sorting the results to see if
it's always the same time, using a snapshot query, setting the Java
client batch size to Integer.MAX_VALUE, and using the toArray() method
on the Java client DBCursor object to get it to quickly pull the
data. I've also removed any indexes that the queries might be using,
so it doesn't appear to be an issue with indexes. We are only querying
the master, though we saw this when querying slaves as well, though
the data missing was very old and it wouldn't have been a replication
issue.
I am only able to reproduce this issue on our production servers
(running Linux), and even with copying all of the files locally, I
have been unable to reproduce this issue on my local machine (using
replicates on Mac and Windows). All versions of the DB are 1.8.1.

I can reproduce the issue in the shell as well, though not as easily.

I turned on profiling logs and here
is the output for the 3 queries (full with 523 results, missing with
522 results, full again with 523 results):

{ "ts" : ISODate("2011-06-24T02:31:23.438Z"), "info" : "query
pacman.packages ntoreturn:2147483647 reslen:619920 nscanned:7110
\nquery: { query: { cluster: \"c1\", installableToEnvironment:

{ $in: [ \"UAT\", \"STAGING\", \"PRODUCTION\" ] }

, deployments: { $elemMatch:
{ environmentType: \"STAGING\", $or: [

{ uninstallDate: null }

,
{ uninstallDate:

{ $gte: new Date(1308277883424) }

} ] } } }, orderby:

{ _id: 1 }

} nreturned:523 bytes:619904", "millis" : 17 }

{ "ts" : ISODate("2011-06-24T02:31:23.924Z"), "info" : "query
pacman.packages ntoreturn:2147483647 reslen:618733 nscanned:7110
\nquery: { query: { cluster: \"c1\", installableToEnvironment:

{ $in: [ \"UAT\", \"STAGING\", \"PRODUCTION\" ] }

, deployments: { $elemMatch:
{ environmentType: \"STAGING\", $or: [

{ uninstallDate: null }

,
{ uninstallDate:

{ $gte: new Date(1308277883909) }

} ] } } }, orderby:

{ _id: 1 }

} nreturned:522 bytes:618717", "millis" : 19 }

{ "ts" : ISODate("2011-06-24T02:31:24.354Z"), "info" : "query
pacman.packages ntoreturn:2147483647 reslen:619920 nscanned:7110
\nquery: { query: { cluster: \"c1\", installableToEnvironment:

{ $in: [ \"UAT\", \"STAGING\", \"PRODUCTION\" ] }

, deployments: { $elemMatch:
{ environmentType: \"STAGING\", $or: [

{ uninstallDate: null }

,
{ uninstallDate:

{ $gte: new Date(1308277884338) }

} ] } } }, orderby:

{ _id: 1 }

} nreturned:523 bytes:619904", "millis" : 19 }



 Comments   
Comment by Michael D. Norman [ 27/Jun/11 ]

I was finally able to reproduce the issue locally and I have found the root cause: we were updating the data for an item in an array by removing the item and inserting a new one in its place. Because that cannot be done atomically, queries were coming into the middle of this operation. In all of my repro scenarios, I was never touching the code that did these updates because they were supposedly rarer than they are.

Comment by Michael D. Norman [ 26/Jun/11 ]

I have been doing more testing, and I'm unable to reproduce this on the same server with a new mongo instance. Therefore, I think the hardware is not necessarily the culprit, and I have a feeling that I just can't reproduce the results while generating similar writes and reads. It appears that the data changes in production are causing the issue, but the root cause is still unknown, especially since a snapshot query is supposed to fix this type of issue.

Comment by Michael D. Norman [ 24/Jun/11 ]

I started this as a thread at http://groups.google.com/group/mongodb-user/browse_thread/thread/bd2d02f9f2129365#

Generated at Thu Feb 08 03:02:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.