Releases: Unstructured-IO/unstructured
Releases · Unstructured-IO/unstructured
0.20.2
0.20.1
What's Changed
- Add Python 3.11 and 3.13 support by @PastelStorm in #4236
- Use bigger runner to publish images by @PastelStorm in #4237
Full Changelog: 0.19.3...0.20.1
0.19.3
What's Changed
- feat: add group_elements_by_parent_id utility function by @MkDev11 in #4207
- Preserve newlines in Table and TableChunk elements during PDF partitioning by @eureka928 in #4214
- Fix: make pdf image dpi consistent by @badGarnet in #4217
- chore: bump dependencies for 0.18.34 by @luke-kucing in #4221
- feat: increase PIL's max image pixel value for pdf partition by @badGarnet in #4220
- Migrate to uv by @PastelStorm in #4226
- Fix ARM64 paddlepaddle image builder bug by @PastelStorm in #4228
- Fix Docker ARM64 image failure, use 8-core github runners by @PastelStorm in #4232
- Fix ARM64 image issues by @PastelStorm in #4233
Full Changelog: 0.18.32...0.19.3
0.18.32
What's Changed
- feat: put pdfium call behind a threadlock by @badGarnet in #4211
Full Changelog: 0.18.31...0.18.32
0.18.31
What's Changed
- Feat: patch pdfminer and use rendermode to detect invisible text by @badGarnet in #4158
- fix: add EN DASH to UNICODE_BULLETS for clean_bullets by @MkDev11 in #4186
- fix: fix version number by @badGarnet in #4189
- enhancement: render pdfs with pdfium by @qued in #4185
- feat: consider rotated text as low fidelityfeat: consider rotated text by @badGarnet in #4190
- fix: address jaraco CVE by @qued in #4198
- fix: hange default for languages parameter from ["auto"] to None by @eureka928 in #4194
- ⚡️ Speed up function
_get_optimal_value_for_bboxby 2,883% by @aseembits93 in #4181 - ⚡️ Speed up method
_DocxPartitioner._style_based_element_typeby 593% by @aseembits93 in #4179 - Luke/update dockerfile by @luke-kucing in #4192
- fix: reduce default dpi to 350 by @qued in #4199
- fix(deps): switch from pip-compile to uv pip compile by @lawrence-u10d in #4202
- fix: remove sandbox=True from pypandoc to fix ODT conversion by @MkDev11 in #4193
- Token-Based Chunking Support by @eureka928 in #4203
- fix: filter coordinates kwargs to prevent TypeError in hi_res PDF processing by @MkDev11 in #4206
- fix(deps): Update docker.elastic.co/elasticsearch/elasticsearch Docker tag to v8.19.10 by @utic-renovate[bot] in #4133
- fix(deps): Update opensearchproject/opensearch Docker tag to v2.19.4 by @utic-renovate[bot] in #4134
- fix(deps): Update semitechnologies/weaviate Docker tag to v1.35.3 by @utic-renovate[bot] in #4135
- fix: Preserve Line Breaks in Code Blocks During Chunking by @eureka928 in #4196
- chorse sep bump to resolve open CVEs by @luke-kucing in #4205
New Contributors
- @MkDev11 made their first contribution in #4186
- @eureka928 made their first contribution in #4194
Full Changelog: 0.18.28...0.18.31
0.18.28
Enhancement
- Optimize
clean_extra_whitespace_with_index_run(codeflash) - Optimize
recursive_xy_cut_swapped(codeflash) - Optimize
_DocxPartitioner._parse_category_depth_by_style_name(codeflash) - Optimize
VertexAIEmbeddingEncoder._add_embeddings_to_elements(codeflash) - Optimize
ngrams(codeflash) - Optimize
stage_for_datasaur(codeflash)
0.18.27
0.18.27
Fixes
- Comment no-ops in
zoom_image(codeflash) - Fix an issue where elements with partially filled extracted text are marked as extracted
Enhancement
- Optimize
sentence_count(codeflash) - Optimize
_PartitionerLoader._load_partitioner(codeflash) - Optimize
detect_languages(codeflash) - Optimize
contains_verb(codeflash) - Optimize
get_bbox_thickness(codeflash) - Upgrade pdfminer-six to 20260107 to fix ~15-18% performance regression from eager f-string evaluation
0.18.26
0.18.26
Fixes
- Pin
deltalake<1.3.0to fix ARM64 Docker builds (1.3.0 missing Linux ARM64 wheels)
0.18.25
Fixes
- Security update: Removed pdfminer.six version constraint and bumped pdfminer.six and urllib3 to address high severity CVEs
0.18.24
Enhancement
- Optimize
OCRAgentTesseract.extract_word_from_hocr(codeflash)
Fixes
- Security update: Bumped dependencies to address security vulnerabilities