Enable aggregation and grouping pushdown in the engine#581
Conversation
|
All existing tests pass with the new Multicorn code which is great news. Let's hold off on merging this until we have test coverage of the new groupby features though. |
| chown -R elasticsearch:elasticsearch /data && \ | ||
| echo 'path.data: /data' >> config/elasticsearch.yml && \ | ||
| echo 'discovery.type: "single-node"' >> config/elasticsearch.yml && \ | ||
| echo "xpack.security.enabled: false" >> config/elasticsearch.yml |
There was a problem hiding this comment.
I'd add some of these here too so that ES doesn't refuse to start with high disk usage
cluster.routing.allocation.disk.watermark.flood_stage: "99%"
cluster.routing.allocation.disk.watermark.high: "99%"| for row in result: | ||
| assert row == (gender, age) | ||
|
|
||
| age += 1 |
There was a problem hiding this comment.
This assertion is kind of difficult to mentally unroll. I recently started using pytest-snapshot (https://pypi.org/project/pytest-snapshot/, usage example: https://github.com/splitgraph/splitgraph/blob/master/test/splitgraph/cloud/project/test_merging.py#L21) that can store expected outputs as files and then easily regenerate them, so that you can essentially approve a change to a test through version control and see what changed. I think it's worth changing some of these (especially the bigger query outputs where you're asserting the whole output) to use pytest-snapshot.
There was a problem hiding this comment.
Fair point, checking out pytest-snapshot
| # DISTINCT queries are not going to be pushed down | ||
| result = get_engine().run_sql("EXPLAIN SELECT COUNT(DISTINCT city) FROM es.account") | ||
| assert len(result) > 2 | ||
| assert _extract_query_from_explain(result) == _bare_sequential_scan |
There was a problem hiding this comment.
Idea for a test: some kind of a JOIN here, e.g. a JOIN between two aggregation results on es.account (I think it won't be pushed down, but useful to test anyway)
There was a problem hiding this comment.
Another idea: a nested expression, e.g. SELECT avg(age * balance) FROM es.account GROUP BY state to make sure this doesn't crash
There was a problem hiding this comment.
Yet another: a window query that I think might be implemented as a sort/groupby, e.g. the balance of the oldest person in each state (haven't tested that the syntax is correct):
SELECT
age, balance, state
RANK () OVER (
PARTITION BY state
ORDER BY age DESC
) rank_number
FROM
es.accounts
WHERE rank_number = 1There was a problem hiding this comment.
Thanks, good ideas!
I'll add them to the test suites.
Extend the Dockerfile for PG debug image so that it optionally compiles Postgres with Valgrind support, and changes the container entrypoint so that PG gets run under Valgrind upon starting (`use_valgrind=1`).
Uh oh!
There was an error while loading. Please reload this page.