feat(v2): load da from csv and save to csv#1144
feat(v2): load da from csv and save to csv#1144anna-charlotte merged 21 commits intofeat-rewrite-v2from
Conversation
|
Will nested schemas work somehow? |
Yes, for nested fields you can give the access path as column names, e.g. 'image.url' to set the url of a field of type Image |
7f03c5e to
ed134c5
Compare
b21d518 to
5126056
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
27303af to
5babdae
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
tests/toydata/docs_nested.csv
Outdated
| @@ -0,0 +1,4 @@ | |||
| count,text,image,image2.url | |||
There was a problem hiding this comment.
| count,text,image,image2.url | |
| count,text,image,image2__url |
Not really sure here but I think we cannot/should not use . inside column name
There was a problem hiding this comment.
Ok, do you mean like by csv convention?
There was a problem hiding this comment.
Yeah kind of because, though it is more gut feeling that actual knowledge about CSV convention. But generally I think we should keep with a__b sintax rather than a.b one. Same for vector database I believe @JohannesMessner
There was a problem hiding this comment.
Ok, will adjust this
There was a problem hiding this comment.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
@JohannesMessner true, at other places we used the dot separation for the access paths (such as in traverse_flat(), e.g.da.traverse_flat(access_path='author.name'). @samsja mentioned that we should not use the . inside the column name. I am not sure how dangerous/against the rules this is
|
|
||
| def test_from_csv_without_schema_raise_exception(): | ||
| with pytest.raises(TypeError, match='no document schema defined'): | ||
| DocumentArray.from_csv(file_path=str(TOYDATA_DIR / 'docs_nested.csv')) |
There was a problem hiding this comment.
Makes sense that this raise an Expection for now. But what about implementing at this level the auto schema detection ?
There was a problem hiding this comment.
Yes, for now the exception, I think the auto detection or handing over a schema some other way then setting .document_type I would handle in a separate PR, if that's good with you.
There was a problem hiding this comment.
yes makes sense to do it in another PR
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
JohannesMessner
left a comment
There was a problem hiding this comment.
Some comments, but I think this will be a truly cool feature! We should promote this properly once it is out, I bet the data validation folks that already use pydantic would love it
docarray/array/array/io.py
Outdated
| """ | ||
| Check if a given access path is a valid path for a given Document class. | ||
| """ | ||
| field, _, remaining = access_path.partition('__') |
There was a problem hiding this comment.
Didn't we use "dot-notation" in other places, e.g. nested.field instead of nested__field? We need to make sure that this is uniform everywhere.
There was a problem hiding this comment.
Update: we will use __ instead of . separator for access paths
docarray/array/array/io.py
Outdated
| def _access_path_to_dict(access_path: str, value) -> Dict[str, Any]: | ||
| """ | ||
| Convert an access path and its value to a (potentially) nested dict. | ||
|
|
||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert access_path_to_dict('image__url', 'img.png') == {'image': {'url': 'img.png'}} | ||
| """ | ||
| fields = access_path.split('__') | ||
| for field in reversed(fields): | ||
| result = {field: value} | ||
| value = result | ||
| return result | ||
|
|
||
|
|
||
| def _dict_to_access_paths(d: dict) -> Dict[str, Any]: | ||
| """ | ||
| Convert a (nested) dict to a Dict[access_path, value]. | ||
| Access paths are defines as a path of field(s) separated by "__". | ||
|
|
||
| EXAMPLE USAGE | ||
| .. code-block:: python | ||
| assert dict_to_access_paths({'image': {'url': 'img.png'}}) == {'image__url', 'img.png'} | ||
| """ | ||
| result = {} | ||
| for k, v in d.items(): | ||
| if isinstance(v, dict): | ||
| v = _dict_to_access_paths(v) | ||
| for nested_k, nested_v in v.items(): | ||
| new_key = '__'.join([k, nested_k]) | ||
| result[new_key] = nested_v | ||
| else: | ||
| result[k] = v | ||
| return result |
There was a problem hiding this comment.
We should consider making these methods global helpers, they will come in handy in other places of the code base as well, e.g. i already implemented something similar for document stores
There was a problem hiding this comment.
yes, sure, sounds good. What exactly do you mean by making these methods global helpers, put them somewhere in a helper file?
There was a problem hiding this comment.
yes exactly, in a helper.py or utils.py at the top level or something like that
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>
Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
bc46938 to
00a9ea7
Compare
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai>
|
📝 Docs are deployed on https://ft-feat-from-to-csv--jina-docs.netlify.app 🎉 |
* feat: load from and to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: from to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * feat: add access path to dict Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: from to csv Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * docs: add docstring and update tmpdir in test Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: merge nested dicts Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: clean up Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * test: update test Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply samis suggestion from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review wrt access paths Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> * fix: apply johannes suggestion Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: apply suggestions from code review Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * fix: typos Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * refactor: move helper functions to helper file Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> * test: fix fixture Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> --------- Signed-off-by: anna-charlotte <charlotte.gerhaher@jina.ai> Signed-off-by: Charlotte Gerhaher <charlotte.gerhaher@jina.ai> Co-authored-by: Johannes Messner <44071807+JohannesMessner@users.noreply.github.com> Signed-off-by: Arnav Zutshi <arnzut1324@gmail.com>
Goals:
document_typeneeds to be set to determine schema.'image__url'None