Skip to content

Commit

Permalink
Convert object storage connector docs to md
Browse files Browse the repository at this point in the history
  • Loading branch information
mosabua committed Aug 22, 2023
1 parent 0663f67 commit 79931e7
Show file tree
Hide file tree
Showing 17 changed files with 2,374 additions and 2,388 deletions.

Large diffs are not rendered by default.

16 changes: 16 additions & 0 deletions docs/src/main/sphinx/connector/hive-alluxio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# Hive connector with Alluxio

The {doc}`hive` can read and write tables stored in the [Alluxio Data Orchestration
System](https://www.alluxio.io/),
leveraging Alluxio's distributed block-level read/write caching functionality.
The tables must be created in the Hive metastore with the `alluxio://`
location prefix (see [Running Apache Hive with Alluxio](https://docs.alluxio.io/os/user/stable/en/compute/Hive.html)
for details and examples).

Trino queries will then transparently retrieve and cache files or objects from
a variety of disparate storage systems including HDFS and S3.

## Setting up Alluxio with Trino

For information on how to setup, configure, and use Alluxio, refer to [Alluxio's
documentation on using their platform with Trino](https://docs.alluxio.io/ee/user/stable/en/compute/Trino.html).
21 changes: 0 additions & 21 deletions docs/src/main/sphinx/connector/hive-alluxio.rst

This file was deleted.

Original file line number Diff line number Diff line change
@@ -1,22 +1,15 @@
=================================
Hive connector with Azure Storage
=================================
# Hive connector with Azure Storage

The :doc:`hive` can be configured to use `Azure Data Lake Storage (Gen2)
<https://azure.microsoft.com/products/storage/data-lake-storage/>`_. Trino
The {doc}`hive` can be configured to use [Azure Data Lake Storage (Gen2)](https://azure.microsoft.com/products/storage/data-lake-storage/). Trino
supports Azure Blob File System (ABFS) to access data in ADLS Gen2.

Trino also supports `ADLS Gen1
<https://learn.microsoft.com/azure/data-lake-store/data-lake-store-overview>`_
and Windows Azure Storage Blob driver (WASB), but we recommend `migrating to
ADLS Gen2
<https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2-azure-portal>`_,
Trino also supports [ADLS Gen1](https://learn.microsoft.com/azure/data-lake-store/data-lake-store-overview)
and Windows Azure Storage Blob driver (WASB), but we recommend [migrating to
ADLS Gen2](https://learn.microsoft.com/azure/storage/blobs/data-lake-storage-migrate-gen1-to-gen2-azure-portal),
as ADLS Gen1 and WASB are legacy options that will be removed in the future.
Learn more from `the official documentation
<https://docs.microsoft.com/azure/data-lake-store/data-lake-store-overview>`_.
Learn more from [the official documentation](https://docs.microsoft.com/azure/data-lake-store/data-lake-store-overview).

Hive connector configuration for Azure Storage credentials
----------------------------------------------------------
## Hive connector configuration for Azure Storage credentials

To configure Trino to use the Azure Storage credentials, set the following
configuration properties in the catalog properties file. It is best to use this
Expand All @@ -26,16 +19,16 @@ The specific configuration depends on the type of storage and uses the
properties from the following sections in the catalog properties file.

For more complex use cases, such as configuring multiple secondary storage
accounts using Hadoop's ``core-site.xml``, see the
:ref:`hive-azure-advanced-config` options.
accounts using Hadoop's `core-site.xml`, see the
{ref}`hive-azure-advanced-config` options.

ADLS Gen2 / ABFS storage
^^^^^^^^^^^^^^^^^^^^^^^^
### ADLS Gen2 / ABFS storage

To connect to ABFS storage, you may either use the storage account's access
key, or a service principal. Do not use both sets of properties at the
same time.

```{eval-rst}
.. list-table:: ABFS Access Key
:widths: 30, 70
:header-rows: 1
Expand All @@ -46,7 +39,9 @@ same time.
- The name of the ADLS Gen2 storage account
* - ``hive.azure.abfs-access-key``
- The decrypted access key for the ADLS Gen2 storage account
```

```{eval-rst}
.. list-table:: ABFS Service Principal OAuth
:widths: 30, 70
:header-rows: 1
Expand All @@ -59,28 +54,28 @@ same time.
- The service principal's client/application ID.
* - ``hive.azure.abfs.oauth.secret``
- A client secret for the service principal.
```

When using a service principal, it must have the Storage Blob Data Owner,
Contributor, or Reader role on the storage account you are using, depending on
which operations you would like to use.

ADLS Gen1 (legacy)
^^^^^^^^^^^^^^^^^^
### ADLS Gen1 (legacy)

While it is advised to migrate to ADLS Gen2 whenever possible, if you still
choose to use ADLS Gen1 you need to include the following properties in your
catalog configuration.

.. note::

Credentials for the filesystem can be configured using ``ClientCredential``
type. To authenticate with ADLS Gen1 you must create a new application
secret for your ADLS Gen1 account's App Registration, and save this value
because you won't able to retrieve the key later. Refer to the Azure
`documentation
<https://docs.microsoft.com/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory>`_
for details.
:::{note}
Credentials for the filesystem can be configured using `ClientCredential`
type. To authenticate with ADLS Gen1 you must create a new application
secret for your ADLS Gen1 account's App Registration, and save this value
because you won't able to retrieve the key later. Refer to the Azure
[documentation](https://docs.microsoft.com/azure/data-lake-store/data-lake-store-service-to-service-authenticate-using-active-directory)
for details.
:::

```{eval-rst}
.. list-table:: ADLS properties
:widths: 30, 70
:header-rows: 1
Expand All @@ -97,10 +92,11 @@ catalog configuration.
* - ``hive.azure.adl-proxy-host``
- Proxy host and port in ``host:port`` format. Use this property to connect
to an ADLS endpoint via a SOCKS proxy.
```

WASB storage (legacy)
^^^^^^^^^^^^^^^^^^^^^
### WASB storage (legacy)

```{eval-rst}
.. list-table:: WASB properties
:widths: 30, 70
:header-rows: 1
Expand All @@ -111,54 +107,51 @@ WASB storage (legacy)
- Storage account name of Azure Blob Storage
* - ``hive.azure.wasb-access-key``
- The decrypted access key for the Azure Blob Storage
```

.. _hive-azure-advanced-config:
(hive-azure-advanced-config)=

Advanced configuration
^^^^^^^^^^^^^^^^^^^^^^
### Advanced configuration

All of the configuration properties for the Azure storage driver are stored in
the Hadoop ``core-site.xml`` configuration file. When there are secondary
the Hadoop `core-site.xml` configuration file. When there are secondary
storage accounts involved, we recommend configuring Trino using a
``core-site.xml`` containing the appropriate credentials for each account.
`core-site.xml` containing the appropriate credentials for each account.

The path to the file must be configured in the catalog properties file:

.. code-block:: text
hive.config.resources=<path_to_hadoop_core-site.xml>
```text
hive.config.resources=<path_to_hadoop_core-site.xml>
```

One way to find your account key is to ask for the connection string for the
storage account. The ``abfsexample.dfs.core.windows.net`` account refers to the
storage account. The `abfsexample.dfs.core.windows.net` account refers to the
storage account. The connection string contains the account key:

.. code-block:: text
```text
az storage account show-connection-string --name abfswales1
{
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfsexample;AccountKey=examplekey..."
}
```

az storage account show-connection-string --name abfswales1
{
"connectionString": "DefaultEndpointsProtocol=https;EndpointSuffix=core.windows.net;AccountName=abfsexample;AccountKey=examplekey..."
}
When you have the account access key, you can add it to your ``core-site.xml``
When you have the account access key, you can add it to your `core-site.xml`
or Java cryptography extension (JCEKS) file. Alternatively, you can have your
cluster management tool to set the option
``fs.azure.account.key.STORAGE-ACCOUNT`` to the account key value:

.. code-block:: text
`fs.azure.account.key.STORAGE-ACCOUNT` to the account key value:

<property>
<name>fs.azure.account.key.abfsexample.dfs.core.windows.net</name>
<value>examplekey...</value>
</property>
```text
<property>
<name>fs.azure.account.key.abfsexample.dfs.core.windows.net</name>
<value>examplekey...</value>
</property>
```

For more information, see `Hadoop Azure Support: ABFS
<https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html>`_.
For more information, see [Hadoop Azure Support: ABFS](https://hadoop.apache.org/docs/stable/hadoop-azure/abfs.html).

Accessing Azure Storage data
----------------------------
## Accessing Azure Storage data

URI scheme to reference data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
### URI scheme to reference data

Consistent with other FileSystem implementations within Hadoop, the Azure
Standard Blob and Azure Data Lake Storage Gen2 (ABFS) drivers define their own
Expand All @@ -169,86 +162,89 @@ different systems.

ABFS URI:

.. code-block:: text
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/<file_name>
```text
abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/<file_name>
```

ADLS Gen1 URI:

.. code-block:: text
adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/<file_name>
```text
adl://<data_lake_storage_gen1_name>.azuredatalakestore.net/<path>/<file_name>
```

Azure Standard Blob URI:

.. code-block:: text
```text
wasb[s]://<container>@<account_name>.blob.core.windows.net/<path>/<path>/<file_name>
```

wasb[s]://<container>@<account_name>.blob.core.windows.net/<path>/<path>/<file_name>
Querying Azure Storage
^^^^^^^^^^^^^^^^^^^^^^
### Querying Azure Storage

You can query tables already configured in your Hive metastore used in your Hive
catalog. To access Azure Storage data that is not yet mapped in the Hive
metastore, you need to provide the schema of the data, the file format, and the
data location.

For example, if you have ORC or Parquet files in an ABFS ``file_system``, you
need to execute a query::
For example, if you have ORC or Parquet files in an ABFS `file_system`, you
need to execute a query:

-- select schema in which the table is to be defined, must already exist
USE hive.default;
```
-- select schema in which the table is to be defined, must already exist
USE hive.default;
-- create table
CREATE TABLE orders (
orderkey BIGINT,
custkey BIGINT,
orderstatus VARCHAR(1),
totalprice DOUBLE,
orderdate DATE,
orderpriority VARCHAR(15),
clerk VARCHAR(15),
shippriority INTEGER,
comment VARCHAR(79)
) WITH (
external_location = 'abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/',
format = 'ORC' -- or 'PARQUET'
);
-- create table
CREATE TABLE orders (
orderkey BIGINT,
custkey BIGINT,
orderstatus VARCHAR(1),
totalprice DOUBLE,
orderdate DATE,
orderpriority VARCHAR(15),
clerk VARCHAR(15),
shippriority INTEGER,
comment VARCHAR(79)
) WITH (
external_location = 'abfs[s]://<file_system>@<account_name>.dfs.core.windows.net/<path>/<path>/',
format = 'ORC' -- or 'PARQUET'
);
```

Now you can query the newly mapped table::
Now you can query the newly mapped table:

SELECT * FROM orders;
```
SELECT * FROM orders;
```

Writing data
------------
## Writing data

Prerequisites
^^^^^^^^^^^^^
### Prerequisites

Before you attempt to write data to Azure Storage, make sure you have configured
everything necessary to read data from the storage.

Create a write schema
^^^^^^^^^^^^^^^^^^^^^
### Create a write schema

If the Hive metastore contains schema(s) mapped to Azure storage filesystems,
you can use them to write data to Azure storage.

If you don't want to use existing schemas, or there are no appropriate schemas
in the Hive metastore, you need to create a new one::
in the Hive metastore, you need to create a new one:

CREATE SCHEMA hive.abfs_export
WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/<path>');
```
CREATE SCHEMA hive.abfs_export
WITH (location = 'abfs[s]://file_system@account_name.dfs.core.windows.net/<path>');
```

Write data to Azure Storage
^^^^^^^^^^^^^^^^^^^^^^^^^^^
### Write data to Azure Storage

Once you have a schema pointing to a location where you want to write the data,
you can issue a ``CREATE TABLE AS`` statement and select your desired file
you can issue a `CREATE TABLE AS` statement and select your desired file
format. The data will be written to one or more files within the
``abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/my_table``
namespace. Example::

CREATE TABLE hive.abfs_export.orders_abfs
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;
`abfs[s]://file_system@account_name.dfs.core.windows.net/<path>/my_table`
namespace. Example:

```
CREATE TABLE hive.abfs_export.orders_abfs
WITH (format = 'ORC')
AS SELECT * FROM tpch.sf1.orders;
```
Loading

0 comments on commit 79931e7

Please sign in to comment.