[SERVER-11407] aggregation results for same dataset/query differ on mongodb 2.4.4 vs 2.5.3 Created: 28/Oct/13  Updated: 10/Dec/14  Resolved: 12/Feb/14

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.4.4, 2.5.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ravi Shakya Assignee: Mathias Stearn
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 12.04 (64 bit)


Attachments: Text File aggregation_query.txt     PNG File results_244.png     PNG File results_253.png     File test9371.json    
Operating System: Linux
Participants:

 Description   

With reference to the attached dataset test9371.json imported into a collection 'zips' (on mongodb 2.4.4 and mongodb 2.5.3),
when we run the following query :

db.zips.aggregate( { $group:
                         { _id: { state: "$state", city: "$city" },
                           zipcount : {$sum : 1},
                           pop: { $sum: "$pop" } } },
                       { $sort: { zipcount: 1 } },
                       { $group:
                         { _id : "$_id.state",
                           biggestCity:  { $last: "$_id.city" },
                           population:   { $last: "$pop" },
                        zipcount : { $last : "$zipcount"}
                            } } ,{ $sort : {population : -1}})

different results are returned on mongodb 2.4.4 (pls refer to attached results_244.png) and mongodb 2.5.3.(pls refer results_253.png).
Could you please investigate whats causing this discrepancy? Thank you



 Comments   
Comment by Mathias Stearn [ 29/Oct/13 ]

The issue is that the first $sort is on zipcount and many cities have the same zipcount, which means that their order is unspecified. This causes a problem since the next operation depends on the order (due to the use of $last).

As a demonstration, here is your data through the first group and sorted on (state, zipcount) under 2.4 on my machine:

> db.test9371.aggregate( { $group: { _id: { state: "$state", city: "$city" }, zipcount : {$sum : 1}, pop: { $sum: "$pop" } } }, { $sort: { '_id.state':1, zipcount: 1 } }).result.forEach(printjsononeline)
{  "_id" : {  "state" : "AL",  "city" : "ADGER" },  "zipcount" : 1,  "pop" : 3205 }
{  "_id" : {  "state" : "AL",  "city" : "ACMAR" },  "zipcount" : 1,  "pop" : 6055 }
{  "_id" : {  "state" : "AL",  "city" : "ADAMSVILLE" },  "zipcount" : 1,  "pop" : 10616 }
{  "_id" : {  "state" : "CA",  "city" : "LOS ANGELES" },  "zipcount" : 3,  "pop" : 146408 }
{  "_id" : {  "state" : "FL",  "city" : "BRANFORD" },  "zipcount" : 1,  "pop" : 2439 }
{  "_id" : {  "state" : "FL",  "city" : "BRYCEVILLE" },  "zipcount" : 1,  "pop" : 1875 }
{  "_id" : {  "state" : "FL",  "city" : "CALLAHAN" },  "zipcount" : 1,  "pop" : 9111 }
{  "_id" : {  "state" : "IL",  "city" : "ANTIOCH" },  "zipcount" : 1,  "pop" : 18058 }
{  "_id" : {  "state" : "IL",  "city" : "ARLINGTON HEIGHT" },  "zipcount" : 2,  "pop" : 79689 }
{  "_id" : {  "state" : "ND",  "city" : "AMENIA" },  "zipcount" : 1,  "pop" : 321 }
{  "_id" : {  "state" : "ND",  "city" : "ABSARAKA" },  "zipcount" : 1,  "pop" : 124 }
{  "_id" : {  "state" : "ND",  "city" : "ALICE" },  "zipcount" : 1,  "pop" : 255 }
{  "_id" : {  "state" : "TX",  "city" : "ALLEN" },  "zipcount" : 1,  "pop" : 24151 }
{  "_id" : {  "state" : "TX",  "city" : "CARROLLTON" },  "zipcount" : 2,  "pop" : 92495 }
{  "_id" : {  "state" : "UT",  "city" : "AMERICAN FORK" },  "zipcount" : 1,  "pop" : 21864 }
{  "_id" : {  "state" : "UT",  "city" : "ALTAMONT" },  "zipcount" : 1,  "pop" : 146 }
{  "_id" : {  "state" : "UT",  "city" : "ALTONAH" },  "zipcount" : 1,  "pop" : 10 }

and 2.5:

> db.test9371.aggregate( { $group: { _id: { state: "$state", city: "$city" }, zipcount : {$sum : 1}, pop: { $sum: "$pop" } } }, { $sort: { '_id.state':1, zipcount: 1 } })
{ "_id" : { "state" : "AL", "city" : "ADGER" }, "zipcount" : 1, "pop" : 3205 }
{ "_id" : { "state" : "AL", "city" : "ADAMSVILLE" }, "zipcount" : 1, "pop" : 10616 }
{ "_id" : { "state" : "AL", "city" : "ACMAR" }, "zipcount" : 1, "pop" : 6055 }
{ "_id" : { "state" : "CA", "city" : "LOS ANGELES" }, "zipcount" : 3, "pop" : 146408 }
{ "_id" : { "state" : "FL", "city" : "BRANFORD" }, "zipcount" : 1, "pop" : 2439 }
{ "_id" : { "state" : "FL", "city" : "CALLAHAN" }, "zipcount" : 1, "pop" : 9111 }
{ "_id" : { "state" : "FL", "city" : "BRYCEVILLE" }, "zipcount" : 1, "pop" : 1875 }
{ "_id" : { "state" : "IL", "city" : "ANTIOCH" }, "zipcount" : 1, "pop" : 18058 }
{ "_id" : { "state" : "IL", "city" : "ARLINGTON HEIGHT" }, "zipcount" : 2, "pop" : 79689 }
{ "_id" : { "state" : "ND", "city" : "AMENIA" }, "zipcount" : 1, "pop" : 321 }
{ "_id" : { "state" : "ND", "city" : "ABSARAKA" }, "zipcount" : 1, "pop" : 124 }
{ "_id" : { "state" : "ND", "city" : "ALICE" }, "zipcount" : 1, "pop" : 255 }
{ "_id" : { "state" : "TX", "city" : "ALLEN" }, "zipcount" : 1, "pop" : 24151 }
{ "_id" : { "state" : "TX", "city" : "CARROLLTON" }, "zipcount" : 2, "pop" : 92495 }
{ "_id" : { "state" : "UT", "city" : "ALTONAH" }, "zipcount" : 1, "pop" : 10 }
{ "_id" : { "state" : "UT", "city" : "AMERICAN FORK" }, "zipcount" : 1, "pop" : 21864 }
{ "_id" : { "state" : "UT", "city" : "ALTAMONT" }, "zipcount" : 1, "pop" : 146 }

Both of these are equally correct as they are both ordered as per the $sort, but they differ on how cities from the same state with the zipcount are ordered. If you check your results, you will see that each version chose ONE OF the the valid cities as the one with the most zipcodes, however, they happened to choose different cities.

If you want reproducible results, I'd suggest making your sort be more selective so that each document has a unique sort key. You can do that by adding _id to the end of the $sort key so that ties always go to the city that comes first (or last) alphabetically.

Comment by Ravi Shakya [ 28/Oct/13 ]

Attaching the query for your perusal

Generated at Thu Feb 08 03:25:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.