Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat!: V2 Qdrant Java client, named vectors #12

Merged
merged 2 commits into from
Feb 28, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@ jobs:
echo "AUTHOR_EMAIL=$AUTHOR_EMAIL" >> $GITHUB_OUTPUT
id: author_info

- name: Set up Java 17
- name: Set up Java 8
uses: actions/setup-java@v3
with:
distribution: 'oracle'
java-version: '17'
java-version: "8"
distribution: temurin
server-id: ossrh
server-username: OSSRH_JIRA_USERNAME
server-password: OSSRH_JIRA_PASSWORD
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,9 +16,9 @@ jobs:
- uses: actions/checkout@v4
- uses: actions/setup-java@v3
with:
java-version: "17"
java-version: "8"
distribution: temurin
- name: Run the Maven tests
run: mvn test
- name: Generate assembly fat JAR
run: mvn clean package -Passembly
run: mvn clean package
25 changes: 13 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,19 +5,19 @@
## Installation 🚀

> [!IMPORTANT]
> Requires Java 17 or above.
> Requires Java 8 or above.

### GitHub Releases 📦

The packaged `jar` file releases can be found [here](https://github.com/qdrant/qdrant-spark/releases).
The packaged `jar` file can be found [here](https://github.com/qdrant/qdrant-spark/releases).

### Building from source 🛠️

To build the `jar` from source, you need [JDK@17](https://www.oracle.com/java/technologies/javase/jdk17-archive-downloads.html) and [Maven](https://maven.apache.org/) installed.
To build the `jar` from source, you need [JDK@18](https://www.azul.com/downloads/#zulu) and [Maven](https://maven.apache.org/) installed.
Once the requirements have been satisfied, run the following command in the project root. 🛠️

```bash
mvn package -P assembly
mvn package
```

This will build and store the fat JAR in the `target` directory by default.
Expand All @@ -30,7 +30,7 @@ For use with Java and Scala projects, the package can be found [here](https://ce
<dependency>
<groupId>io.qdrant</groupId>
<artifactId>spark</artifactId>
<version>1.12.1</version>
<version>2.0</version>
</dependency>
```

Expand All @@ -43,7 +43,7 @@ from pyspark.sql import SparkSession

spark = SparkSession.builder.config(
"spark.jars",
"spark-1.12.1-jar-with-dependencies.jar", # specify the downloaded JAR file
"spark-2.0.jar", # specify the downloaded JAR file
)
.master("local[*]")
.appName("qdrant")
Expand All @@ -58,7 +58,7 @@ To load data into Qdrant, a collection has to be created beforehand with the app
<pyspark.sql.DataFrame>
.write
.format("io.qdrant.spark.Qdrant")
.option("qdrant_url", <QDRANT_URL>)
.option("qdrant_url", <QDRANT_GRPC_URL>)
.option("collection_name", <QDRANT_COLLECTION_NAME>)
.option("embedding_field", <EMBEDDING_FIELD_NAME>) # Expected to be a field of type ArrayType(FloatType)
.option("schema", <pyspark.sql.DataFrame>.schema.json())
Expand All @@ -70,31 +70,32 @@ To load data into Qdrant, a collection has to be created beforehand with the app
- An API key can be set using the `api_key` option to make authenticated requests.

## Databricks

You can use the `qdrant-spark` connector as a library in Databricks to ingest data into Qdrant.

- Go to the `Libraries` section in your cluster dashboard.
- Select `Install New` to open the library installation modal.
- Search for `io.qdrant:spark:1.12.1` in the Maven packages and click `Install`.
- Search for `io.qdrant:spark:2.0` in the Maven packages and click `Install`.

<img width="1064" alt="Screenshot 2024-01-05 at 17 20 01 (1)" src="https://github.com/qdrant/qdrant-spark/assets/46051506/d95773e0-c5c6-4ff2-bf50-8055bb08fd1b">


## Datatype support 📋

Qdrant supports all the Spark data types, and the appropriate types are mapped based on the provided `schema`.
Qdrant supports all the Spark data types. The appropriate types are mapped based on the provided `schema`.

## Options and Spark types 🛠️

| Option | Description | DataType | Required |
| :---------------- | :------------------------------------------------------------------------ | :--------------------- | :------- |
| `qdrant_url` | REST URL of the Qdrant instance | `StringType` | ✅ |
| `qdrant_url` | GRPC URL of the Qdrant instance. Eg: <http://localhost:6334> | `StringType` | ✅ |
| `collection_name` | Name of the collection to write data into | `StringType` | ✅ |
| `embedding_field` | Name of the field holding the embeddings | `ArrayType(FloatType)` | ✅ |
| `schema` | JSON string of the dataframe schema | `StringType` | ✅ |
| `mode` | Write mode of the dataframe. Supports "append". | `StringType` | ✅ |
| `id_field` | Name of the field holding the point IDs. Default: Generates a random UUId | `StringType` | ❌ |
| `batch_size` | Max size of the upload batch. Default: 100 | `IntType` | ❌ |
| `retries` | Number of upload retries. Default: 3 | `IntType` | ❌ |
| `api_key` | Qdrant API key to be sent in the header. Default: null | `StringType` | ❌ |
| `vector_name` | Name of the vector in the collection. Default: null | `StringType` | ❌ |

## LICENSE 📜

Expand Down
129 changes: 76 additions & 53 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
<modelVersion>4.0.0</modelVersion>
<groupId>io.qdrant</groupId>
<artifactId>spark</artifactId>
<version>1.12.1</version>
<version>2.0</version>
<name>qdrant-spark</name>
<url>https://github.com/qdrant/qdrant-spark</url>
<description>An Apache Spark connector for the Qdrant vector database</description>
Expand All @@ -31,31 +31,68 @@
</scm>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<maven.compiler.source>17</maven.compiler.source>
<maven.compiler.target>17</maven.compiler.target>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
</properties>
<dependencies>
<!-- QDRANT CLIENT DEPENDENCIES -->

<dependency>
<groupId>com.squareup.okhttp3</groupId>
<artifactId>okhttp</artifactId>
<version>4.9.1</version>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
<version>30.1-jre</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.10.1</version>
<groupId>io.grpc</groupId>
<artifactId>grpc-protobuf</artifactId>
<version>1.59.0</version>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
<groupId>io.qdrant</groupId>
<artifactId>client</artifactId>
<version>1.7.1</version>
<scope>compile</scope>
</dependency>

<!-- SPARK DEPENDENCIES -->
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>2.0.7</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.13</artifactId>
<version>3.5.0</version>
<scope>provided</scope>
<exclusions>
<exclusion>
<groupId>com.google.guava</groupId>
<artifactId>guava</artifactId>
</exclusion>
</exclusions>
</dependency>

<!-- TEST DEPENDENCIES -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>1.19.4</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>junit-jupiter</artifactId>
<version>1.19.4</version>
<scope>test</scope>
</dependency>
</dependencies>
<distributionManagement>
Expand All @@ -70,13 +107,6 @@
</distributionManagement>
<build>
<plugins>
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.22.1</version>
<configuration>
<argLine>--add-exports java.base/sun.nio.ch=ALL-UNNAMED</argLine>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-source-plugin</artifactId>
Expand Down Expand Up @@ -135,40 +165,33 @@
</configuration>
</plugin>
<plugin>
<groupId>com.spotify.fmt</groupId>
<artifactId>fmt-maven-plugin</artifactId>
<version>2.21.1</version>
<goals>
<goal>format</goal>
</goals>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.2</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<minimizeJar>true</minimizeJar>
<transformers>
<transformer
implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer" />
</transformers>
<relocations>
<relocation>
<pattern>com.google</pattern>
<shadedPattern>com.shaded.google</shadedPattern>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
<profiles>
<!-- maven-assembly-plugin -->
<profile>
<id>assembly</id>
<build>
<plugins>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<version>3.3.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</profile>
</profiles>


</project>
25 changes: 13 additions & 12 deletions src/main/java/io/qdrant/spark/Qdrant.java
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
import org.apache.spark.sql.util.CaseInsensitiveStringMap;

/**
* A class that implements the TableProvider and DataSourceRegister interfaces. Provides methods to
* A class that implements the TableProvider and DataSourceRegister interfaces. Provides methods to
* infer schema, get table, and check required options.
*/
Expand All @@ -37,9 +38,13 @@ public String shortName() {
*/
@Override
public StructType inferSchema(CaseInsensitiveStringMap options) {

for (String fieldName : requiredFields) {
if (!options.containsKey(fieldName)) {
throw new IllegalArgumentException(fieldName.concat(" option is required"));
}
}
StructType schema = (StructType) StructType.fromJson(options.get("schema"));
checkRequiredOptions(options, schema);
validateOptions(options, schema);

return schema;
}
Expand All @@ -61,33 +66,29 @@ public Table getTable(
}

/**
* Checks if the required options are present in the provided options and if the id_field and
* embedding_field options are present in the provided schema.
* Checks if the required options are present in the provided options and chekcs if the specified
* id_field and embedding_field are present in the provided schema.
*
* @param options The options to check.
* @param schema The schema to check.
*/
void checkRequiredOptions(CaseInsensitiveStringMap options, StructType schema) {
for (String fieldName : requiredFields) {
if (!options.containsKey(fieldName)) {
throw new IllegalArgumentException(fieldName + " option is required");
}
}
void validateOptions(CaseInsensitiveStringMap options, StructType schema) {

List<String> fieldNames = Arrays.asList(schema.fieldNames());

if (options.containsKey("id_field")) {
String idField = options.get("id_field").toString();

if (!fieldNames.contains(idField)) {
throw new IllegalArgumentException("id_field option is not present in the schema");
throw new IllegalArgumentException("Specified 'id_field' is not present in the schema");
}
}

String embeddingField = options.get("embedding_field").toString();

if (!fieldNames.contains(embeddingField)) {
throw new IllegalArgumentException("embedding_field option is not present in the schema");
throw new IllegalArgumentException(
"Specified 'embedding_field' is not present in the schema");
}
}
}
Loading
Loading