[SERVER-33207] geo_borders.js fails in 2 shards sharded collections passthrough Created: 08/Feb/18  Updated: 29/Oct/23  Resolved: 12/Feb/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.7.2

Type: Bug Priority: Major - P3
Reporter: Charlie Swanson Assignee: David Storch
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Query 2018-02-12, Query 2018-02-26
Participants:
Linked BF Score: 0

 Description   

The index build at geo_borders.js:24, which is expected to fail, may fail on only one of the shards. This can happen by chance if the out-of-bounds points that cause the failure all get assigned to the same shard, leaving one shard that has no illegal points.



 Comments   
Comment by Max Hirschhorn [ 12/Feb/18 ]

It sounds like we're converging on leaving the sharded passthrough testing as is, i.e. not adding a new single-shard jsCore passthrough suite. Max Hirschhorn Charlie Swanson, can we consider this thread closed?

Sounds good to me.

Comment by Charlie Swanson [ 12/Feb/18 ]

Yep. I'm on the same page - I don't see a reason to add another passthrough suite.

Comment by David Storch [ 12/Feb/18 ]

I'll reaffirm that I still don't know much about the codepath for these commands in mongos very well, and would defer to either of you or the Sharding team on whether it seems likely that the routing logic could have a bug in the single-shard vs unsharded case.

I think it's unlikely that, aside from the aggregation system, there would be a routing bug that manifests in the single-shard case, but not the multi-shard or unsharded cases.

The failure observed with the geo_borders.js test seems to me more of an issue that our JavaScript tests make assertions that depend on the chunk distribution among the shards.

I'm not sure I'd characterize it that way. In my view, the problem is that the index build behavior presented to a client for the sharded case does not match the behavior presented to a client in the standalone case. The sharding team needs to do additional work in order to make failed index builds clean up properly in a sharded cluster in the way that they do on a standalone. In a way, this is much like having to blacklist a test for command x from the sharded collections passthrough because it does not function correctly against a sharded collection.

Is there a better way we could detect these kinds of JavaScript tests without waiting for them to sometimes fail in the future?

I guess we could audit tests in jstests/core/ looking for those that make assertions about failed index builds? I would propose holding off on such an audit unless we start seeing more failures like this one, however.

It sounds like we're converging on leaving the sharded passthrough testing as is, i.e. not adding a new single-shard jsCore passthrough suite. max.hirschhorn charlie.swanson, can we consider this thread closed?

Comment by Githook User [ 12/Feb/18 ]

Author:

{'email': 'david.storch@10gen.com', 'name': 'David Storch', 'username': 'dstorch'}

Message: SERVER-33207 Blacklist geo_borders.js from sharded_collections_jscore_passthrough.
Branch: master
https://github.com/mongodb/mongo/commit/fe1587bfc8e3507eb044721a0fa98c456659b629

Comment by Max Hirschhorn [ 12/Feb/18 ]

> 2. What's your opinion on whether we should maintain multiple jsCore passthroughs with sharded collections, as I describe above?

I'm not aware of commands such as "find" and "count" have special logic for when they target a single shard, so I think until we add anything like that it'd be fine to change sharded_collections_jscore_passthrough.yml to have >1 shards and not add any new test suites for running jsCore tests against a sharded cluster.

https://jira.mongodb.org/browse/SERVER-31785?focusedCommentId=1761459&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-1761459

I'll reaffirm that I still don't know much about the codepath for these commands in mongos very well, and would defer to either of you or the Sharding team on whether it seems likely that the routing logic could have a bug in the single-shard vs unsharded case. The failure observed with the geo_borders.js test seems to me more of an issue that our JavaScript tests make assertions that depend on the chunk distribution among the shards. (Prior to the changes from SERVER-30344, we would have creating the index on all shards, even if they didn't own any data for the collection.) Is there a better way we could detect these kinds of JavaScript tests without waiting for them to sometimes fail in the future? It feels to me like an issue on the same level as our JavaScript tests depend on an implicit ordering of documents when not specifying a sort.

Comment by Charlie Swanson [ 09/Feb/18 ]

Would you care to make an argument? This failure doesn't convince me that we're missing coverage. To me, this looks like a success of replacing that suite, since we figured out something that does not work when the collection is sharded, but does when it's unsharded. I don't think there's much value in providing guarantees/coverage of things that work when your collection is sharded but only lives on one shard? That describes the unsharded configuration?

I think a motivation for a suite with one shard would look different. A bug that only manifested when all the data lived on a single shard, but worked fine in two shards would be more motivating.

Comment by David Storch [ 09/Feb/18 ]

charlie.swanson max.hirschhorn, my planned fix is to blacklist geo_borders.js from the sharded_collections_jscore_passthrough suite. This makes me wonder whether or not it would be wise to reintroduce a variant of sharded_collections_jscore_passthrough that uses a single shard (in other words, a jsCore variant of aggregation_one_shard_sharded_collections). Do we still believe that this wouldn't add valuable coverage beyond sharding_jscore_passthrough?

Comment by David Storch [ 09/Feb/18 ]

I spoke with kaloian.manassiev, and he confirmed that this falls within a known category of issues where we don't clean up properly on failure. Many related improvements are planned as future work. For now, we should change our testing to work around the problem.

Since the sharded_collections_jscore_passthrough implicitly shards by {_id: "hashed"}, there isn't a good way to guarantee that both shards have an invalid out-of-bounds point.

Generated at Thu Feb 08 04:32:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.