Speed up sgr commit / chunking by mildbyte · Pull Request #611 · splitgraph/sgr

mildbyte · 2022-01-13T10:36:26Z

Use PG cursors + FETCH to partition tables on commit.

Instead of doing a pre-flight query to first figure out partition boundaries,
then chunking the table by filtering on those boundaries (poor performance on
tables without PKs and foreign tables), use a PG server-side cursor and run
FETCHes to add data to CStore files.

Two changes to behaviour: not making objects in parallel and moving the object
hashing to be after we've grabbed the data from the cursor into a temporary
object (after which we actually rename it).

Other simplifications to object creation:

When renaming the object, do a pre-flight check to make sure it doesn't
already exist instead of setting up savepoints. If we do get a race, it's fine
because we'll just reupsert existing meta(data) with the same values.

Also simplify create_base_fragment (delete now-unneeded chunking logic)

Benchmarking

Setup: I had a CSV file with some data after running sgr cloud download:

$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | head
timestamp,query_id,remote_address,application,user_id,username,matomo_id,access_token,dbname,original_query,success,used_images
2020-08-27 21:30:24,--XUMXQB-GrJfm60NHBY,82.4.198.115,pgcli,7575f36b-5f6e-4f86-b05e-b1cc91c73fa1,mildbyte,,false,,"SELECT * FROM ""splitgraph/election-geodata:latest"".nation g
WHERE g.state = '36'
        AND g.county = '061' AND g.precinct = '3606183003'
        AND ST_IsValid(geom)",t,"[""splitgraph/election-geodata:latest""]"
$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | wc -l
12232602


$ sgr init benchmark/commit-speed
$ zcat query-e2abcf218bbc964ff9999b08c06c97447018c395-20211217-134600.csv.gz | sgr csv import benchmark/commit-speed query-results
Warning: primary key is not specified, using the whole row as primary key.This is probably not something that you want.

Without a primary key

time sgr commit --snap --overwrite benchmark/commit-speed

Old sgr: started a tqdm progressbar with a high-variance ETA (because objects are committed in parallel, so we get 8 objects bumping the pbar at the same time). Terminated after 25 minutes with tqdm estimating a total time of 1h30m.

New sgr: 3 trials, 5m36s, 5m3s, 5m29s (avg 5m23s)

With a primary key

sgr sql -s benchmark/commit-speed 'ALTER TABLE "query-results" ADD PRIMARY KEY (timestamp, query_id)'

Old sgr: 5m32s, 6m14s, 5m49s (avg 5m52s)
New sgr: 3m23s, 3m39s, 3m15s (avg 3m26s)

Instead of doing a pre-flight query to first figure out partition boundaries, then chunking the table by filtering on those boundaries (poor performance on tables without PKs and foreign tables), use a PG server-side cursor and run FETCHes to add data to CStore files. Two changes to behaviour: not making objects in parallel and moving the object hashing to be after we've grabbed the data from the cursor into a temporary object (after which we actually rename it). The latter might actually speed things up, since the hashing will be done against the CStore object which is faster to read. [still WIP, couple broken tests. Also needs some refactoring since `create_base_fragment` is now a spaghettifest and I'm not sure if some of its args are no longer used]

When renaming the object, do a pre-flight check to make sure it doesn't already exist instead of setting up savepoints. If we do get a race, it's fine because we'll just reupsert existing meta(data) with the same values.

We now detect those in the object manager, not the engine.

- remove now-unused `chunk_condition` args - make `table_schema` mandatory (all code passes it in anyway)

If this `finally` is hit after an error, we've already rolled back the tx and so the cursor doesn't exist any more. This means we get an error when we try to close it, which masks the real error.

mildbyte added 10 commits January 10, 2022 16:24

Run the non-db code outside of the try block

4e7d60c

Close the cursor at the end of the commit

f886108

Simplify object creation logic

606cec4

When renaming the object, do a pre-flight check to make sure it doesn't already exist instead of setting up savepoints. If we do get a race, it's fine because we'll just reupsert existing meta(data) with the same values.

Fix the test for object creation inconsistencies

e946564

We now detect those in the object manager, not the engine.

Update the pre-commit hook versions

57d0f25

Fix typechecks

f233898

Don't ignore overwrite in create_base_fragment

2961237

Simplify create_base_fragment

824eb9b

- remove now-unused `chunk_condition` args - make `table_schema` mandatory (all code passes it in anyway)

Only drop the cursor if it exists

0c264bf

If this `finally` is hit after an error, we've already rolled back the tx and so the cursor doesn't exist any more. This means we get an error when we try to close it, which masks the real error.

mildbyte merged commit fc987c1 into master Jan 13, 2022

mildbyte deleted the feature/CU-1z4bhdb-sgr-commit-speedups branch January 13, 2022 11:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up sgr commit / chunking#611

Speed up sgr commit / chunking#611
mildbyte merged 10 commits intomasterfrom
feature/CU-1z4bhdb-sgr-commit-speedups

mildbyte commented Jan 13, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mildbyte commented Jan 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Use PG cursors + FETCH to partition tables on commit.

Other simplifications to object creation:

Benchmarking

Without a primary key

With a primary key

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mildbyte commented Jan 13, 2022 •

edited

Loading