Skip to content

Commit

Permalink
Add Document Models (#1)
Browse files Browse the repository at this point in the history
* Add base models

* Add unit tests

* Add setup.py and requirements-dev.txt

* Update README.md

* Add basic dev container configuration

---------

Co-authored-by: Alessio Vertemati <alessio.vertemati@gmail.com>
  • Loading branch information
andreaponti5 and avvertix authored Sep 17, 2024
1 parent bdb2391 commit a2e4cd1
Show file tree
Hide file tree
Showing 17 changed files with 850 additions and 2 deletions.
16 changes: 16 additions & 0 deletions .devcontainer/devcontainer.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// For format details, see https://aka.ms/devcontainer.json. For config options, see the
// README at: https://github.com/devcontainers/templates/tree/main/src/python
{
"name": "Python 3",
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye",

// Features to add to the dev container. More info: https://containers.dev/features.
// "features": {},

// Use 'postCreateCommand' to run commands after the container is created.
"postCreateCommand": "pip3 install --user -r requirements.txt -r requirements-dev.txt"

// Configure tool-specific properties.
// "customizations": {}
}
54 changes: 54 additions & 0 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Contributing

Contributions are **welcome** and will be fully **credited**.

Please read and understand the contribution guide before creating an issue or pull request.

## Etiquette

This project is open source, and as such, the maintainers give their free time to build and maintain the source code held within. They make the code freely available in the hope that it will be of use to other developers. It would be extremely unfair for them to suffer abuse or anger for their hard work.

Please be considerate towards maintainers when raising issues or presenting pull requests. Let's show the
world that developers are civilized and selfless people.

It's the duty of the maintainer to ensure that all submissions to the project are of sufficient
quality to benefit the project. Many developers have different skillsets, strengths, and weaknesses. Respect the maintainer's decision, and do not be upset or abusive if your submission is not used.

## Viability

When requesting or submitting new features, first consider whether it might be useful to others. Open
source projects are used by many developers, who may have entirely different needs to your own. Think about
whether or not your feature is likely to be used by other users of the project.

## Procedure

> [!NOTE]
> Issue tracking is not currently enabled for this repository. We are organising it.
Before filing an issue:

- Attempt to replicate the problem, to ensure that it wasn't a coincidental incident.
- Check to make sure your feature suggestion isn't already present within the project.
- Check the pull requests tab to ensure that the bug doesn't have a fix in progress.
- Check the pull requests tab to ensure that the feature isn't already in progress.

Before submitting a pull request:

- Check the codebase to ensure that your feature doesn't already exist.
- Check the pull requests to ensure that another person hasn't already submitted the feature or fix.

## Requirements

If the project maintainer has any additional requirements, you will find them listed here.

- **Add tests!** - Your patch won't be accepted if it doesn't have tests.

- **Document any change in behaviour** - Make sure the `README.md` and any other relevant documentation are kept up-to-date.

- **Consider our release cycle** - We try to follow [SemVer v2.0.0](https://semver.org/). Randomly breaking public APIs is not an option.

- **One pull request per feature** - If you want to do more than one thing, send multiple pull requests.

- **Send coherent history** - Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please [squash them](https://www.git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Changing-Multiple-Commit-Messages) before submitting.

**Happy coding**!
3 changes: 3 additions & 0 deletions .github/SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Security Policy

If you discover any security related issues, please email security@oneofftech.xyz instead of using the discussions or the issue tracker.
2 changes: 1 addition & 1 deletion LICENSE
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
MIT License

Copyright (c) 2024 Andrea Ponti
Copyright (c) OneOffTech <info@oneofftech.xyz>

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
Expand Down
210 changes: 209 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,209 @@
# :card_box: Document Model Python
![pypi](https://img.shields.io/pypi/v/parse-document-model-python.svg)
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges)
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)

# Parse Document Model (Python)

**Parse Document Model** (Python) provides Pydantic models for representing text documents using a hierarchical model.
This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more.

These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown.

- **Hierarchical structure**: The document is modelled as a hierarchy of nodes. Each node can represent a part of the
document itself, pages, text.
- **Rich text support**: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text.
- **Attributes**: Each node can have attributes that provide additional information such as page number,
bounding box, etc.
- **Built-in validation and types**: Built with [`Pydantic`](https://docs.pydantic.dev/latest/), ensuring type safety, validation and effortless creation of complex document structures.


**Requirements**

- Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort).


**Next steps**

- [Explore the document model](#document-model-overview)
- [Install the library and use the models](#getting-started)


## Document Model Overview

We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following.

```
Document
├─Page
│ ├─Text (category: heading)
│ └─Text (category: body)
└─Page
├─Text (category: heading)
└─Text (category: body)
```

At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph.

### Node types

```mermaid
classDiagram
class Node
Node <|-- StructuredNode
Node <|-- Text
StructuredNode <|-- Document
StructuredNode <|-- Page
```


#### 1. **Node** (Base Class)

This is the abstract class from which all other nodes inherit.

Each node has:

- `category`: The type of the node (e.g., `doc`, `page`, `heading`).
- `attributes`: Optional field to attach extra data to a node. See [Attributes](#attributes).

#### 2. **StructuredNode**

This extends the [`Node`](#1-node-base-class). It is used to represent the hierarchy as a node whose content is a list of other nodes, such as like [`Document`](#3-document) and [`Page`](#4-page).

- `content`: List of `Node`.


#### 3. **Document**

This is the root node of a document.

- `category`: Always set to `"doc"`.
- `attributes`: Document-wide attributes can be set here.
- `content`: List of [`Page`](#4-page) nodes that form the document.

#### 4. **Page**

Represents a page in the document:

- `category`: Always set to `"page"`.
- `attributes`: Can contain metadata like page number.
- `content`: List of [`Text`](#5-text) nodes on the page.

#### 5. **Text**

This node represent a paragraph, a heading or any text within the document.

- `category`: The type `"doc"`.
- `content`: A string representing the textual content.
- `marks`: List of [marks](#marks) applied to the text, such as bold, italic, etc.
- `attributes`: Can contain metadata like the bounding box representing where this portion of text is located in the page.



### Marks

Marks are used to add style or functionality to the text within a [`Text`](#5-text) node.
For example, bold text, italic text, links and custom styles such as font or colour.

**Mark Types**

- `Bold`: Represents bold text.
- `Italic`: Represents italic text.
- `TextStyle`: Allows customization of font and color.
- `Link`: Represents a hyperlink.

Marks are validated and enforced with the help of `Pydantic` model validators.

### Attributes

Attributes are optional fields that can store additional information for each node. Some predefined attributes are:

- `DocumentAttributes`: General attributes for the document (currently reserved for the future).
- `PageAttributes`: Specific page related attributes, such as the page number.
- `TextAttributes`: Text related attributes, such as bounding boxes.
- `BoundingBox`: A box that specifies the position of a text in the page.


## Getting started

### Installation

Parse Document Model is distributed with PyPI. You can install it with `pip`.

```bash
pip install parse-document-model-python
```

### Quick Example

Here’s how you can represent a simple document with one page and some text:

```python
from document_model_python.document import Document, Page, Text

doc = Document(
category="doc",
content=[
Page(
category="page",
content=[
Text(
category="heading",
content="Welcome to parse-document-model-python",
marks=["bold"]
),
Text(
category="body",
content="This is an example text using the document model."
)
]
)
]
)
```

## Testing

Parse Document Model is tested using [pytest](https://docs.pytest.org/en/stable/). Tests run for each commit and pull request.

Install the dependencies.

```bash
pip install -r requirements.txt -r requirements-dev.txt
```

Execute the test suite.

```bash
pytest
```


## Contributing

Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file.

> [NOTE]
> Consider opening a [discussion](https://github.com/OneOffTech/parse-document-model-python/discussions) before submitting a pull request with changes to the model structures.
## Security Vulnerabilities

Please review [our security policy](./.github/SECURITY.md) on how to report security vulnerabilities.

## Credits

- [OneOffTech](https://github.com/OneOffTech)
- [All Contributors](../../contributors)

## Supporters

The project is provided and supported by [OneOff-Tech (UG)](https://oneofftech.de).

<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p>

## Aknowledgements

The format and structure takes inspiration from [ProseMirror](https://prosemirror.net/docs/ref/#model.Document_Schema).

## License

The MIT License (MIT). Please see [License File](LICENSE.md) for more information.
1 change: 1 addition & 0 deletions parse_document_model/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .document import Document, Page
27 changes: 27 additions & 0 deletions parse_document_model/attributes.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
from abc import ABC

from pydantic import BaseModel


class BoundingBox(BaseModel):
min_x: float
min_y: float
max_x: float
max_y: float
page: int


class Attributes(BaseModel, ABC):
pass


class DocumentAttributes(Attributes):
pass


class PageAttributes(Attributes):
page: int


class TextAttributes(Attributes):
bounding_box: list[BoundingBox] = []
Loading

0 comments on commit a2e4cd1

Please sign in to comment.