Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build(deps): bump unstructured_inference to 0.5.0 #618

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
da00c7f
update installation docs
MthwRobinson May 18, 2023
98bd864
bump versions
MthwRobinson May 18, 2023
4b6b142
version and changelog
MthwRobinson May 18, 2023
5360953
remove detectron from local inference and ci
MthwRobinson May 18, 2023
2a6850f
remove detectron2 from dockerfile
MthwRobinson May 18, 2023
d04499a
remove detectron2 from dockerfile
MthwRobinson May 18, 2023
4c1c39e
add tabulate
MthwRobinson May 18, 2023
526e159
fix auto tests
MthwRobinson May 18, 2023
ce498d5
fix pdf tests
MthwRobinson May 18, 2023
4409f92
linting, linting, linting
MthwRobinson May 18, 2023
deed8f0
inference instead of detectron2 for hi res check
MthwRobinson May 18, 2023
310ec2e
Revert "fix auto tests"
MthwRobinson May 18, 2023
391d224
Revert "Revert "fix auto tests""
MthwRobinson May 18, 2023
26516c1
no more initialize
MthwRobinson May 18, 2023
a359b36
no more detectron2 import
MthwRobinson May 18, 2023
63d303b
linting, linting, linting
MthwRobinson May 18, 2023
0f908b4
update test fixtures
MthwRobinson May 19, 2023
803a650
Merge branch 'main' into build/bump-versions-and-release
MthwRobinson May 19, 2023
02ffddd
update auto pdf tests
MthwRobinson May 19, 2023
677a9bb
Merge branch 'build/bump-versions-and-release' of github.com:Unstruct…
MthwRobinson May 19, 2023
caccda7
get_model in initialize
MthwRobinson May 19, 2023
99b3c47
remove recalibrating output
MthwRobinson May 19, 2023
bcd5d04
removing silent giant
MthwRobinson May 19, 2023
bdea419
change s3 ingest tests to fast
MthwRobinson May 19, 2023
1d9a8e0
change s3 tests to auto
MthwRobinson May 19, 2023
a6bf4d3
element switcheroo
MthwRobinson May 19, 2023
08e2403
more switcheroos
MthwRobinson May 19, 2023
77bcf90
test fixture updates
MthwRobinson May 19, 2023
e2dde54
ingest tests with detectron2
MthwRobinson May 19, 2023
d6ae3be
install detectron2
MthwRobinson May 19, 2023
232c7e2
revert dockerfile
MthwRobinson May 19, 2023
372882e
detectron2 to original sport
MthwRobinson May 19, 2023
f807149
comment out s3 to test
MthwRobinson May 19, 2023
5b89b7d
Merge branch 'main' into build/bump-versions-and-release
MthwRobinson May 19, 2023
a45df6b
update fixtures
MthwRobinson May 19, 2023
1931505
Merge branch 'build/bump-versions-and-release' of github.com:Unstruct…
MthwRobinson May 19, 2023
6493d65
Merge branch 'main' into build/bump-versions-and-release
MthwRobinson May 19, 2023
a47ad86
add s3 back in
MthwRobinson May 19, 2023
33306d5
Merge branch 'build/bump-versions-and-release' of github.com:Unstruct…
MthwRobinson May 19, 2023
69ef184
test number of files
MthwRobinson May 19, 2023
eab2726
no more detectron2 install
MthwRobinson May 19, 2023
779b897
only test good outputs
MthwRobinson May 19, 2023
03277d0
comment with link to inference issue
MthwRobinson May 19, 2023
f41d209
remove commented out import
MthwRobinson May 19, 2023
58423f5
docs update
MthwRobinson May 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 0 additions & 2 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,6 @@ jobs:
- name: Test
run: |
source .venv/bin/activate
make install-detectron2
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
Expand Down Expand Up @@ -144,7 +143,6 @@ jobs:
DISCORD_TOKEN: ${{ secrets.DISCORD_TOKEN }}
run: |
source .venv/bin/activate
make install-detectron2
sudo apt-get update
sudo apt-get install -y libmagic-dev poppler-utils libreoffice pandoc
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr5
Expand Down
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
## 0.7.0-dev0

### Enhancements

* Installing `detectron2` from source is no longer required when using the `local-inference` extra.

### Features

### Fixes

* Better handling of the output order for multicolumn documents when using the `"hi_res"` strategy.

## 0.6.7

### Enhancements
Expand Down
3 changes: 1 addition & 2 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,7 @@ RUN python3.8 -m pip install pip==${PIP_VERSION} && \
pip install --no-cache -r requirements/ingest-slack.txt && \
pip install --no-cache -r requirements/ingest-wikipedia.txt && \
pip install --no-cache -r requirements/local-inference.txt && \
scl enable devtoolset-9 bash && \
pip install --no-cache "detectron2@git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2"
scl enable devtoolset-9 bash

COPY example-docs example-docs
COPY unstructured unstructured
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -103,7 +103,7 @@ install-detectron2: install-tensorboard

## install-local-inference: installs requirements for local inference
.PHONY: install-local-inference
install-local-inference: install install-unstructured-inference install-detectron2
install-local-inference: install install-unstructured-inference

## pip-compile: compiles all base/dev/test requirements
.PHONY: pip-compile
Expand Down
1 change: 0 additions & 1 deletion docker/ubuntu-22/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,6 @@ SHELL ["/bin/bash", "-c"]
RUN source ~/.bashrc && pyenv virtualenv 3.8.15 unstructured && \
source ~/.pyenv/versions/unstructured/bin/activate && \
make install-ci && \
make install-detectron2 && \
make install-ingest-s3 && \
make install-ingest-azure && \
make install-ingest-github && \
Expand Down
3 changes: 1 addition & 2 deletions docs/source/installing.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@ installation.
* ``libreoffice`` (MS Office docs)
* ``pandocs`` (EPUBs, RTFs and Open Office docs)

* If you are parsing PDFs, run the following to install the ``detectron2`` model, which ``unstructured`` uses for layout detection:
* ``pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@e2ce8dc#egg=detectron2"``
* Follow the instructions `here <https://github.com/Unstructured-IO/unstructured-inference#detectron2>`_ to install ``detectron2``. This is required if you would like to use custom models from the `LayoutParser Model Zoo <https://github.com/Unstructured-IO/unstructured-inference#using-models-from-the-layoutparser-model-zoo>`_.

At this point, you should be able to run the following code:

Expand Down
13 changes: 10 additions & 3 deletions requirements/base.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,11 @@
#
anyio==3.6.2
# via httpcore
argilla==1.6.0
argilla==1.7.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
certifi==2022.12.7
certifi==2023.5.7
# via
# httpcore
# httpx
Expand All @@ -23,7 +23,9 @@ charset-normalizer==3.1.0
# pdfminer-six
# requests
click==8.1.3
# via nltk
# via
# nltk
# typer
commonmark==0.9.1
# via rich
cryptography==40.0.2
Expand Down Expand Up @@ -113,14 +115,19 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
tabulate==0.9.0
# via unstructured (setup.py)
tqdm==4.65.0
# via
# argilla
# nltk
typer==0.9.0
# via argilla
typing-extensions==4.5.0
# via
# pydantic
# rich
# typer
urllib3==2.0.2
# via requests
wrapt==1.14.1
Expand Down
50 changes: 21 additions & 29 deletions requirements/local-inference.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,14 +7,12 @@
antlr4-python3-runtime==4.9.3
# via omegaconf
anyio==3.6.2
# via
# httpcore
# starlette
argilla==1.6.0
# via httpcore
argilla==1.7.0
# via unstructured (setup.py)
backoff==2.2.1
# via argilla
certifi==2022.12.7
certifi==2023.5.7
# via
# httpcore
# httpx
Expand All @@ -29,7 +27,7 @@ charset-normalizer==3.1.0
click==8.1.3
# via
# nltk
# uvicorn
# typer
coloredlogs==15.0.1
# via onnxruntime
commonmark==0.9.1
Expand All @@ -46,23 +44,19 @@ effdet==0.3.0
# via layoutparser
et-xmlfile==1.1.0
# via openpyxl
fastapi==0.95.1
# via unstructured-inference
filelock==3.12.0
# via
# huggingface-hub
# torch
# transformers
flatbuffers==23.3.3
flatbuffers==23.5.9
# via onnxruntime
fonttools==4.39.3
fonttools==4.39.4
# via matplotlib
fsspec==2023.4.0
fsspec==2023.5.0
# via huggingface-hub
h11==0.14.0
# via
# httpcore
# uvicorn
# via httpcore
httpcore==0.16.3
# via httpx
httpx==0.23.3
Expand Down Expand Up @@ -172,16 +166,14 @@ pillow==9.5.0
# unstructured (setup.py)
portalocker==2.7.0
# via iopath
protobuf==4.22.4
protobuf==4.23.1
# via onnxruntime
pycocotools==2.0.6
# via effdet
pycparser==2.21
# via cffi
pydantic==1.10.7
# via
# argilla
# fastapi
# via argilla
pygments==2.15.1
# via rich
pypandoc==1.11
Expand Down Expand Up @@ -225,6 +217,8 @@ rfc3986[idna2008]==1.5.0
# via httpx
rich==13.0.1
# via argilla
safetensors==0.3.1
# via timm
scipy==1.10.1
# via layoutparser
six==1.16.0
Expand All @@ -234,23 +228,21 @@ sniffio==1.3.0
# anyio
# httpcore
# httpx
starlette==0.26.1
# via fastapi
sympy==1.11.1
sympy==1.12
# via
# onnxruntime
# torch
timm==0.6.13
timm==0.9.2
# via effdet
tokenizers==0.13.3
# via transformers
torch==2.0.0
torch==2.0.1
# via
# effdet
# layoutparser
# timm
# torchvision
torchvision==0.15.1
torchvision==0.15.2
# via
# effdet
# layoutparser
Expand All @@ -262,22 +254,22 @@ tqdm==4.65.0
# iopath
# nltk
# transformers
transformers==4.28.1
transformers==4.29.2
# via unstructured-inference
typer==0.9.0
# via argilla
typing-extensions==4.5.0
# via
# huggingface-hub
# iopath
# pydantic
# rich
# starlette
# torch
unstructured-inference==0.4.4
# typer
unstructured-inference==0.5.0
# via unstructured (setup.py)
urllib3==2.0.2
# via requests
uvicorn==0.22.0
# via unstructured-inference
wand==0.6.11
# via pdfplumber
wrapt==1.14.1
Expand Down
4 changes: 3 additions & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Topic :: Scientific/Engineering :: Artificial Intelligence",
],
author="Unstructured Technologies",
Expand All @@ -63,6 +64,7 @@
"python-pptx",
"python-magic",
"markdown",
"tabulate",
"requests",
# NOTE(robinson) - The following dependencies are pinned
# to address security scans
Expand All @@ -77,7 +79,7 @@
"transformers",
],
"local-inference": [
"unstructured-inference==0.4.4",
"unstructured-inference==0.5.0",
],
"s3": ["s3fs", "fsspec"],
"azure": ["adlfs", "fsspec"],
Expand Down
8 changes: 4 additions & 4 deletions test_unstructured/partition/test_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -291,12 +291,12 @@ def test_auto_partition_pdf_from_filename(pass_file_filename, content_type):
assert isinstance(elements[0], Title)
assert elements[0].text.startswith("LayoutParser")

assert isinstance(elements[1], NarrativeText)
assert elements[1].text.startswith("Zejiang Shen")

assert elements[0].metadata.filename == os.path.basename(filename)
assert elements[0].metadata.file_directory == os.path.split(filename)[0]

assert isinstance(elements[1], NarrativeText)
assert elements[1].text.startswith("Shen")


def test_auto_partition_pdf_uses_table_extraction():
filename = os.path.join(EXAMPLE_DOCS_DIRECTORY, "layout-parser-paper-fast.pdf")
Expand Down Expand Up @@ -346,7 +346,7 @@ def test_auto_partition_pdf_from_file(pass_file_filename, content_type):
assert elements[0].text.startswith("LayoutParser")

assert isinstance(elements[1], NarrativeText)
assert elements[1].text.startswith("Zejiang Shen")
assert elements[1].text.startswith("Shen")


def test_partition_pdf_doesnt_raise_warning():
Expand Down
Loading