Merged
Conversation
Instead of doing a pre-flight query to first figure out partition boundaries, then chunking the table by filtering on those boundaries (poor performance on tables without PKs and foreign tables), use a PG server-side cursor and run FETCHes to add data to CStore files. Two changes to behaviour: not making objects in parallel and moving the object hashing to be after we've grabbed the data from the cursor into a temporary object (after which we actually rename it). The latter might actually speed things up, since the hashing will be done against the CStore object which is faster to read. [still WIP, couple broken tests. Also needs some refactoring since `create_base_fragment` is now a spaghettifest and I'm not sure if some of its args are no longer used]
When renaming the object, do a pre-flight check to make sure it doesn't already exist instead of setting up savepoints. If we do get a race, it's fine because we'll just reupsert existing meta(data) with the same values.
We now detect those in the object manager, not the engine.
- remove now-unused `chunk_condition` args - make `table_schema` mandatory (all code passes it in anyway)
If this `finally` is hit after an error, we've already rolled back the tx and so the cursor doesn't exist any more. This means we get an error when we try to close it, which masks the real error.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Use PG cursors + FETCH to partition tables on commit.
Instead of doing a pre-flight query to first figure out partition boundaries,
then chunking the table by filtering on those boundaries (poor performance on
tables without PKs and foreign tables), use a PG server-side cursor and run
FETCHes to add data to CStore files.
Two changes to behaviour: not making objects in parallel and moving the object
hashing to be after we've grabbed the data from the cursor into a temporary
object (after which we actually rename it).
Other simplifications to object creation:
When renaming the object, do a pre-flight check to make sure it doesn't
already exist instead of setting up savepoints. If we do get a race, it's fine
because we'll just reupsert existing meta(data) with the same values.
Also simplify
create_base_fragment(delete now-unneeded chunking logic)Benchmarking
Setup: I had a CSV file with some data after running
sgr cloud download:Without a primary key
Old sgr: started a tqdm progressbar with a high-variance ETA (because objects are committed in parallel, so we get 8 objects bumping the pbar at the same time). Terminated after 25 minutes with tqdm estimating a total time of 1h30m.
New sgr: 3 trials, 5m36s, 5m3s, 5m29s (avg 5m23s)
With a primary key
Old sgr: 5m32s, 6m14s, 5m49s (avg 5m52s)
New sgr: 3m23s, 3m39s, 3m15s (avg 3m26s)