Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
ba1a4b6
feat: add tensorflow tensor
Jan 30, 2023
978dfe4
feat: wip add tf comp backend
Jan 30, 2023
712c950
fix: comp backend working of TensorFlowTensor, not tf tensor
Jan 30, 2023
fd88185
test: remove redundant print statements
Jan 30, 2023
17fe3d5
feat: add comp backend retrieval
Jan 31, 2023
b0fd980
fix: extract methods that overlap for np and tf backend
Jan 31, 2023
c2e6ab0
fix: revert poetry lock change
Jan 31, 2023
cf6a0ef
fix: introduce norm callables to transform tftensor
Jan 31, 2023
194ab9f
docs: clean up
Jan 31, 2023
6a2ecf1
fix: retrieval and add docstring
Jan 31, 2023
056db70
fix: add cosine sim for tf backend matrics
Feb 1, 2023
6c35902
fix: euclidean dist
Feb 1, 2023
2abe113
fix: add typevar to register proto
Feb 1, 2023
beb340e
fix: clean up
Feb 1, 2023
1817c44
fix: add tft to inits
Feb 2, 2023
a74daac
test: add tests for tensorflow tensor
Feb 2, 2023
ab7d153
fix: mypy checks
Feb 2, 2023
72744ad
fix: docarray from native
Feb 2, 2023
dfdff10
docs: add documentatino and clean up
Feb 2, 2023
28a7291
fix: clean up
Feb 2, 2023
418ee37
fix: clean up
Feb 2, 2023
217c870
fix: stacked array with tf tensor
Feb 3, 2023
fd7a8e5
fix: stack with tftensor
Feb 3, 2023
26150b6
test: fix get item test
Feb 3, 2023
567b56a
fix: access by slice for tftensor
Feb 3, 2023
665f408
fix: add proto for tf
Feb 6, 2023
68bdd77
test: introduce pytest tensorflow marker
Feb 6, 2023
fcd5b74
fix: typo in ci.yml
Feb 6, 2023
1e8e240
fix: try tf import
Feb 6, 2023
a5988ab
fix: mypy
Feb 6, 2023
156a508
fix: ndarray import
Feb 6, 2023
a29f5c1
fix: tf import
Feb 6, 2023
eb6a53a
test: add tf markers
Feb 6, 2023
a2afa41
test: fix unit tests
Feb 6, 2023
9cc04dd
test: fix unit tests
Feb 6, 2023
bfffc2d
fix: tf in array stacked
Feb 6, 2023
6e715b3
test: tf
Feb 6, 2023
cc2e837
chore: pytest proto marker call with -m
Feb 6, 2023
125f66c
fix: instance check use instance shape
Feb 6, 2023
d6506d1
fix: tf tests
Feb 6, 2023
73f8b0a
fix: test
Feb 6, 2023
2d9162c
fix: add print statement to debug
Feb 6, 2023
ef335ad
fix: tf test
Feb 6, 2023
1e85f7a
test: only tf
Feb 6, 2023
ca8d1d1
test: remove tests for debugging
Feb 6, 2023
1dd9c6e
test: add all tests back to ci yml
Feb 6, 2023
f8d8426
test: fix import
Feb 6, 2023
269cf0f
test: ci debugging
Feb 6, 2023
9bcb816
test: change pytest marker for tf
Feb 6, 2023
dac79e2
test: change python version back
Feb 6, 2023
8046e99
test: revert
Feb 6, 2023
e325244
test: debugging
Feb 6, 2023
b7db1c8
fix: test
Feb 6, 2023
9d1ef56
fix: tests
Feb 6, 2023
0467aca
test: ignore paths
Feb 6, 2023
a25dceb
fix: tests
Feb 6, 2023
52064d2
fix: tests
Feb 6, 2023
591cea1
refactor: rename norm left and norm right
Feb 7, 2023
5fc2721
docs: tft docstring
Feb 7, 2023
7432123
docs: add comment to array stacked tf
Feb 7, 2023
1144d6f
fix: apply suggestion from code review
Feb 7, 2023
25b9f42
fix: apply suggestions from code review
Feb 7, 2023
20a2b3e
fix: merge
Feb 7, 2023
102e42a
test: fix black formatting
Feb 7, 2023
b4b7e43
fix: implement getitem setitem iter for tftensor
Feb 7, 2023
545438d
docs: readme
Feb 7, 2023
4905526
Merge remote-tracking branch 'origin/feat-rewrite-v2' into feat-tenso…
Feb 8, 2023
c2aa0b1
docs: update readme.md
Feb 8, 2023
81a2540
fix: remove n dim from abstract method instead use comp be
Feb 8, 2023
838955a
fix: remove proto mark, because only test for proto 3 here
Feb 8, 2023
09dd6a9
fix: tf set item and add tests
Feb 8, 2023
5daa511
Merge branch 'feat-rewrite-v2' into feat-tensorflow-support
Feb 8, 2023
6c36d25
Merge branch 'feat-rewrite-v2' into feat-tensorflow-support
Feb 8, 2023
85183ec
docs: update tf section in readme.md
Feb 8, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 35 additions & 5 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -54,11 +54,12 @@ jobs:
uses: actions/setup-python@v4
with:
python-version: 3.7
- name: Prepare enviroment
- name: Prepare environment
run: |
python -m pip install --upgrade pip
python -m pip install poetry
poetry install --without dev
poetry run pip install tensorflow==2.11.0
- name: Test basic import
run: poetry run python -c 'from docarray import DocumentArray, BaseDocument'

Expand Down Expand Up @@ -111,11 +112,12 @@ jobs:
python -m pip install --upgrade pip
python -m pip install poetry
poetry install --all-extras
poetry run pip install tensorflow==2.11.0

- name: Test
id: test
run: |
poetry run pytest ${{ matrix.test-path }}
poetry run pytest -m "not tensorflow" ${{ matrix.test-path }}
timeout-minutes: 30
# env:
# JINA_AUTH_TOKEN: "${{ secrets.JINA_AUTH_TOKEN }}"
Expand Down Expand Up @@ -159,7 +161,7 @@ jobs:
- name: Test
id: test
run: |
poetry run pytest ${{ matrix.test-path }}
poetry run pytest -m "not tensorflow" ${{ matrix.test-path }}
timeout-minutes: 30


Expand All @@ -181,12 +183,40 @@ jobs:
python -m pip install --upgrade pip
python -m pip install poetry
poetry install --all-extras
pip install protobuf==3.19.0 # we check that we support 3.19
poetry run pip install protobuf==3.19.0 # we check that we support 3.19

- name: Test
id: test
run: |
poetry run pytest -m 'proto' tests
timeout-minutes: 30


docarray-test-tensorflow:
needs: [lint-ruff, check-black, import-test]
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: [3.7]
steps:
- uses: actions/checkout@v2.5.0
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}
- name: Prepare environment
run: |
python -m pip install --upgrade pip
python -m pip install poetry
poetry install --all-extras
poetry run pip install protobuf==3.19.0
poetry run pip install tensorflow==2.11.0

- name: Test
id: test
run: |
poetry run pytest -k 'proto' tests
poetry run pytest -m 'tensorflow' tests
timeout-minutes: 30


Expand Down
71 changes: 57 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -209,22 +209,22 @@ class MyMultiModalModel(nn.Module):
self.text_encoder = TextEncoder()

def forward(self, text_1, text_2, image_1, image_2, audio_1, audio_2):
emnedding_text_1 = self.text_encoder(text_1)
emnedding_text_2 = self.text_encoder(text_2)
embedding_text_1 = self.text_encoder(text_1)
embedding_text_2 = self.text_encoder(text_2)

emnedding_image_1 = self.image_encoder(image_1)
emnedding_image_2 = self.image_encoder(image_2)
embedding_image_1 = self.image_encoder(image_1)
embedding_image_2 = self.image_encoder(image_2)

emnedding_audio_1 = self.image_encoder(audio_1)
emnedding_audio_2 = self.image_encoder(audio_2)
embedding_audio_1 = self.image_encoder(audio_1)
embedding_audio_2 = self.image_encoder(audio_2)

return (
emnedding_text_1,
emnedding_text_2,
emnedding_image_1,
emnedding_image_2,
emnedding_audio_1,
emnedding_audio_2,
embedding_text_1,
embedding_text_2,
embedding_image_1,
embedding_image_2,
embedding_audio_1,
embedding_audio_2,
)
```

Expand Down Expand Up @@ -258,14 +258,14 @@ class MyPodcastModel(nn.Module):
self.image_encoder = ImageEncoder()
self.text_encoder = TextEncoder()

def forward_podcast(da: DocumentArray[Podcast]) -> DocumentArray[Podcast]:
def forward_podcast(self, da: DocumentArray[Podcast]) -> DocumentArray[Podcast]:
da.audio.embedding = self.audio_encoder(da.audio.tensor)
da.text.embedding = self.text_encoder(da.text.tensor)
da.image.embedding = self.image_encoder(da.image.tensor)

return da

def forward(da: DocumentArray[PairPodcast]) -> DocumentArray[PairPodcast]:
def forward(self, da: DocumentArray[PairPodcast]) -> DocumentArray[PairPodcast]:
da.left = self.forward_podcast(da.left)
da.right = self.forward_podcast(da.right)

Expand All @@ -277,6 +277,49 @@ You instantly win in code readability and maintainability. And for the same pric
schema definition (see below). Everything handles in a pythonic manner by relying on type hints.


## Coming from TensorFlow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this part is too big IMO. We just need to show that there is a (tiny ?) difference and that you need to access tensor.tensor. No need to show the full example


Similar to the PyTorch approach, you can also use DocArray with TensorFlow to handle and represent multi-modal data inside your ML model.

First off, to use DocArray with TensorFlow we first need to install it as follows:
```
pip install tensorflow==2.11.0
pip install protobuf==3.19.0
```

Compared to using DocArray with PyTorch, there is one main difference when using it with TensorFlow:\
While DocArray's `TorchTensor` is a subclass of `torch.Tensor`, this is not the case for the `TensorFlowTensor`: Due to technical limitations on `tf.Tensor`, docarray's `TensorFlowTensor` is not a subclass of `tf.Tensor` but instead stores a `tf.Tensor` in its `.tensor` attribute.

How does this effect you? Whenever you want to access the tensor data to e.g. do operations with it or hand it to your ML model, instead of handing over your `TensorFlowTensor` instance, you need to access its `.tensor` attribute.

This would look like the following:

```python
from typing import Optional

from docarray import DocumentArray, BaseDocument

import tensorflow as tf


class Podcast(BaseDocument):
audio_tensor: Optional[AudioTensorFlowTensor]
embedding: Optional[AudioTensorFlowTensor]


class MyPodcastModel(tf.keras.Model):
def __init__(self):
super().__init__()
self.audio_encoder = AudioEncoder()

def call(self, inputs: DocumentArray[Podcast]) -> DocumentArray[Podcast]:
inputs.audio_tensor.embedding = self.audio_encoder(
inputs.audio_tensor.tensor
) # access audio_tensor's .tensor attribute
return inputs
```



## Coming from FastAPI

Expand Down
36 changes: 32 additions & 4 deletions docarray/array/array_stacked.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,14 +27,22 @@
from pydantic.fields import ModelField

from docarray.proto import DocumentArrayStackedProto
from docarray.typing import TorchTensor
from docarray.typing.tensor.abstract_tensor import AbstractTensor

try:
from docarray.typing import TorchTensor
except ImportError:
TorchTensor = None # type: ignore

try:
import tensorflow as tf # type: ignore

from docarray.typing import TensorFlowTensor
Comment on lines +36 to +39
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for torch i moved this thing to a helper in utils, so this check only has to be done once globally. Can we do the same for tf?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, saw that and started doing this in the TF embedding/video/audio PR, so i'll do this refactor there if that's fine with you @JohannesMessner


tf_available = True
except (ImportError, TypeError):
TensorFlowTensor = None # type: ignore
tf_available = False

T = TypeVar('T', bound='DocumentArrayStacked')
IndexIterType = Union[slice, Iterable[int], Iterable[bool], None]

Expand Down Expand Up @@ -163,7 +171,26 @@ def _create_columns(
tensor_columns: Dict[str, AbstractTensor] = dict()

for field, type_ in column_schema.items():
if issubclass(type_, AbstractTensor):
if tf_available and isinstance(getattr(docs[0], field), TensorFlowTensor):
# tf.Tensor does not allow item assignment, therefore the optimized way
# of initializing an empty array and assigning values to it iteratively
# does not work here, therefore handle separately.
tf_stack = []
for i, doc in enumerate(docs):
val = getattr(doc, field)
if val is None:
val = tensor_type.get_comp_backend().none_value()
tf_stack.append(val.tensor)
del val.tensor

stacked: tf.Tensor = tf.stack(tf_stack)
tensor_columns[field] = TensorFlowTensor(stacked)
for i, doc in enumerate(docs):
val = getattr(doc, field)
x = tensor_columns[field][i].tensor
val.tensor = x

elif issubclass(type_, AbstractTensor):
tensor = getattr(docs[0], field)
column_shape = (
(len(docs), *tensor.shape) if tensor is not None else (len(docs),)
Expand All @@ -190,7 +217,8 @@ def _create_columns(
# We thus chose to convert the individual rank 0 tensors to rank 1
# This does mean that stacking rank 0 tensors will transform them
# to rank 1
if tensor_columns[field].ndim == 1:
tensor = tensor_columns[field]
if tensor.get_comp_backend().n_dim(tensor) == 1:
setattr(doc, field, tensor_columns[field][i : i + 1])
else:
setattr(doc, field, tensor_columns[field][i])
Expand Down
54 changes: 28 additions & 26 deletions docarray/computation/abstract_comp_backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,76 +19,77 @@ class AbstractComputationalBackend(ABC, typing.Generic[TTensor]):
That way, DocArray can leverage native implementations from all frameworks.
"""

@staticmethod
@classmethod
@abstractmethod
def stack(
tensors: Union[List['TTensor'], Tuple['TTensor']], dim: int = 0
cls, tensors: Union[List['TTensor'], Tuple['TTensor']], dim: int = 0
) -> 'TTensor':
"""
Stack a list of tensors along a new axis.
"""
...

@staticmethod
@classmethod
@abstractmethod
def n_dim(array: 'TTensor') -> int:
def n_dim(cls, array: 'TTensor') -> int:
"""
Get the number of the array dimensions.
"""
...

@staticmethod
@classmethod
@abstractmethod
def squeeze(tensor: 'TTensor') -> 'TTensor':
def squeeze(cls, tensor: 'TTensor') -> 'TTensor':
"""
Returns a tensor with all the dimensions of tensor of size 1 removed.
"""
...

@staticmethod
@classmethod
@abstractmethod
def to_numpy(array: 'TTensor') -> 'np.ndarray':
def to_numpy(cls, array: 'TTensor') -> 'np.ndarray':
"""
Convert array to np.ndarray.
"""
...

@staticmethod
@classmethod
@abstractmethod
def empty(
cls,
shape: Tuple[int, ...],
dtype: Optional[Any] = None,
device: Optional[Any] = None,
) -> 'TTensor':
...

@staticmethod
@classmethod
@abstractmethod
def none_value() -> typing.Any:
def none_value(cls) -> typing.Any:
"""Provide a compatible value that represents None in the Tensor Backend."""
...

@staticmethod
@classmethod
@abstractmethod
def to_device(tensor: 'TTensor', device: str) -> 'TTensor':
def to_device(cls, tensor: 'TTensor', device: str) -> 'TTensor':
"""Move the tensor to the specified device."""
...

@staticmethod
@classmethod
@abstractmethod
def device(tensor: 'TTensor') -> Optional[str]:
def device(cls, tensor: 'TTensor') -> Optional[str]:
"""Return device on which the tensor is allocated."""
...

@staticmethod
@classmethod
@abstractmethod
def shape(tensor: 'TTensor') -> Tuple[int, ...]:
def shape(cls, tensor: 'TTensor') -> Tuple[int, ...]:
"""Get shape of tensor"""
...

@staticmethod
@classmethod
@abstractmethod
def reshape(tensor: 'TTensor', shape: Tuple[int, ...]) -> 'TTensor':
def reshape(cls, tensor: 'TTensor', shape: Tuple[int, ...]) -> 'TTensor':
"""
Gives a new shape to tensor without changing its data.

Expand All @@ -99,9 +100,9 @@ def reshape(tensor: 'TTensor', shape: Tuple[int, ...]) -> 'TTensor':
"""
...

@staticmethod
@classmethod
@abstractmethod
def detach(tensor: 'TTensor') -> 'TTensor':
def detach(cls, tensor: 'TTensor') -> 'TTensor':
"""
Returns the tensor detached from its current graph.

Expand All @@ -110,21 +111,22 @@ def detach(tensor: 'TTensor') -> 'TTensor':
"""
...

@staticmethod
@classmethod
@abstractmethod
def dtype(tensor: 'TTensor') -> Any:
def dtype(cls, tensor: 'TTensor') -> Any:
"""Get the data type of the tensor."""
...

@staticmethod
@classmethod
@abstractmethod
def isnan(tensor: 'TTensor') -> 'TTensor':
def isnan(cls, tensor: 'TTensor') -> 'TTensor':
"""Check element-wise for nan and return result as a boolean array"""
...

@staticmethod
@classmethod
@abstractmethod
def minmax_normalize(
cls,
tensor: 'TTensor',
t_range: Tuple = (0, 1),
x_range: Optional[Tuple] = None,
Expand Down
Loading