Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't Install Tableau API on arm64 #218

Merged
merged 3 commits into from
Jan 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .env
Original file line number Diff line number Diff line change
Expand Up @@ -20,4 +20,4 @@ PUBLIC_ARCHIVE_BUCKET=mbta-ctd-dataplatform-dev-archive
# Tableau
TABLEAU_USER=DOUPDATE
TABLEAU_PASSWORD=DOUPDATE
TABLEAU_SERVER=http://awtabDEV02.mbta.com
TABLEAU_SERVER=http://awtabDEV02.mbta.com
2 changes: 1 addition & 1 deletion .envrc
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
use asdf

dotenv
dotenv
2 changes: 1 addition & 1 deletion .tool-versions
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
poetry 1.4.2
poetry 1.7.1
python 3.10.13
direnv 2.32.2
12 changes: 10 additions & 2 deletions python_src/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,21 @@ RUN chmod a=r /usr/local/share/amazon-certs.pem

# Install poetry
RUN pip install -U pip
RUN pip install "poetry==1.4.2"
RUN pip install "poetry==1.7.1"

# copy poetry and pyproject files and install dependencies
WORKDIR /lamp/
COPY poetry.lock poetry.lock
COPY pyproject.toml pyproject.toml
RUN poetry install --no-dev --no-interaction --no-ansi -v

# Tableau dependencies for arm64 cannot be resolved (since salesforce doesn't
# support them yet). For that buildplatform build without those dependencies
ARG TARGETARCH BUILDPLATFORM TARGETPLATFORM
RUN echo "Installing python dependencies for build: ${BUILDPLATFORM} target: ${TARGETPLATFORM}"
RUN if [ "$TARGETARCH" = "arm64" ]; then \
poetry install --without tableau --no-interaction --no-ansi -v ;\
else poetry install --no-interaction --no-ansi -v ;\
fi

# Copy src directory to run against and build lamp py
COPY src src
Expand Down
1,373 changes: 625 additions & 748 deletions python_src/poetry.lock

Large diffs are not rendered by default.

9 changes: 7 additions & 2 deletions python_src/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,10 +27,15 @@ psutil = "^5.9.1"
schedule = "^1.1.0"
alembic = "^1.10.2"
types-pytz = "^2023.3.0.1"

[tool.poetry.group.tableau]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add dependency group for tableau dependencies. its not optional, meaning a user will have to change behavior to build without it.

optional = false

[tool.poetry.group.tableau.dependencies]
tableauhyperapi = "^0.0.17971"
tableauserverclient = "0.25"

[tool.poetry.dev-dependencies]
[tool.poetry.group.dev.dependencies]
black = "^23.1.0"
mypy = "^1.1.1"
pylint = "^2.17.0"
Expand Down Expand Up @@ -80,6 +85,6 @@ max-line-length = 80
min-similarity-lines = 10
# ignore session maker as it gives pylint fits
# https://github.com/PyCQA/pylint/issues/7090
ignored-classes = ['sqlalchemy.orm.session.sessionmaker','pyarrow.compute']
ignored-classes = ['sqlalchemy.orm.session.sessionmaker', 'pyarrow.compute']
# ignore the migrations directory. its going to have duplication and _that is ok_.
ignore-paths = ["^src/lamp_py/migrations/.*$"]
2 changes: 1 addition & 1 deletion python_src/src/lamp_py/performance_manager/pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from lamp_py.runtime_utils.env_validation import validate_environment
from lamp_py.runtime_utils.process_logger import ProcessLogger

from lamp_py.tableau.pipeline import start_parquet_updates
from lamp_py.tableau import start_parquet_updates
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the only part i don't like. if we import directly from the file we're doing to get the import error and it felt like catch that error should be left to the subdir.


from .flat_file import write_flat_files
from .l0_gtfs_rt_events import process_gtfs_rt_files
Expand Down
21 changes: 21 additions & 0 deletions python_src/src/lamp_py/tableau/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Tableau Publisher

The Tableau Publisher is an application that takes data created by the Rail Performance Manager application as parquet files and publishes them to the ITD Managed Tableau Instance as hyper files.

## Application Operation

The application itself is run via a cloudwatch event that is set to trigger on a cronlike schedule.

On each run, it iterates through a list of jobs that generate hyper files and uploads them to the ITD Tableau server, where they can be used to generate dashboards and reports for external users. To generate the job reads a parquet file that has been created by upstream LAMP applications and converts it to a hyper file using the [Tableau Hyper API](https://www.tableau.com/developer/tools/hyper-api). The file is generated on local storage, and then uploaded to the ITD Managed Tableau server using the [Tableau Server Client](https://tableau.github.io/server-client-python/), a python library wrapping the [Tableau REST API](https://help.tableau.com/current/api/rest_api/en-us/REST/rest_api.htm).

### Upstream Applications

To simplify the conversion from parquet to hyper, the schemas for both are defined within this module. We also store the hardcoded S3 filepaths. Because of this, components of this library are used by other applications when writing the parquet files.

## Developer Note

The Tableau Hyper API is not currently supported on Apple Silicon. This means that local execution on Mac OSX with arm64 processors will not work without emulation. In light of that, imports from this directory will trigger `ModuleNotFound` exceptions if running on the wrong system. To avoid that, the `__init__.py` file includes a wrapper around components that are consumed by other applications. These functions will log an error when run without the desired dependencies.

### Installation without Tableau dependencies

In `pyproject.toml`, there is an additional dependency group that contains the tableau dependencies. It is not marked optional, so these modules will be installed with `poetry install`. If you are on an arm64 architecture, you can avoid installing the tableau dependencies with `poetry install --without tableau`. This behavior is encoded in the `.envrcy`, `docker-compose.yml`, and `Dockerfile` files in this repository, so you should get the desired behavior without additional arguments.
33 changes: 32 additions & 1 deletion python_src/src/lamp_py/tableau/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,32 @@
"""Utilites for Interacting with Tableau and Hyper files"""
"""Utilities for Interacting with Tableau and Hyper files"""
import logging
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the tableau init file to catch exceptions when the tableau modues arent found.

from types import ModuleType
from typing import Optional

from lamp_py.postgres.postgres_utils import DatabaseManager

# pylint: disable=C0103 (invalid-name)
# pylint wants pipeline to conform to an UPPER_CASE constant naming style. its
# a module though, so disabling to allow it to use normal import rules.
pipeline: Optional[ModuleType]

try:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion(non-blocking):

try:
  import .pipeline as pipeline
except ModuleNotFoundError:
  pipeline = None

def start_parquet_updates(db_manager: DatabaseManager) -> None
  if pipeline is None:
    logging.exception(...)
  else:
    pipeline.start_parquet_updates(db_manager)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh. i like this much better.

from . import pipeline
except ModuleNotFoundError:
pipeline = None

# pylint: enable=C0103 (invalid-name)


def start_parquet_updates(db_manager: DatabaseManager) -> None:
"""
wrapper around pipeline.start_parquet_updates function. if a module not
found error occurs (which happens when using osx arm64 dependencies), log
an error and do nothing. else, run the function.
"""
if pipeline is None:
logging.error(
"Unable to run parquet files on this machine due to Module Not Found error"
)
else:
pipeline.start_parquet_updates(db_manager=db_manager)
Loading