-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Add base models * Add unit tests * Add setup.py and requirements-dev.txt * Update README.md * Add basic dev container configuration --------- Co-authored-by: Alessio Vertemati <alessio.vertemati@gmail.com>
- Loading branch information
1 parent
bdb2391
commit a2e4cd1
Showing
17 changed files
with
850 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
// For format details, see https://aka.ms/devcontainer.json. For config options, see the | ||
// README at: https://github.com/devcontainers/templates/tree/main/src/python | ||
{ | ||
"name": "Python 3", | ||
// Or use a Dockerfile or Docker Compose file. More info: https://containers.dev/guide/dockerfile | ||
"image": "mcr.microsoft.com/devcontainers/python:1-3.12-bullseye", | ||
|
||
// Features to add to the dev container. More info: https://containers.dev/features. | ||
// "features": {}, | ||
|
||
// Use 'postCreateCommand' to run commands after the container is created. | ||
"postCreateCommand": "pip3 install --user -r requirements.txt -r requirements-dev.txt" | ||
|
||
// Configure tool-specific properties. | ||
// "customizations": {} | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,54 @@ | ||
# Contributing | ||
|
||
Contributions are **welcome** and will be fully **credited**. | ||
|
||
Please read and understand the contribution guide before creating an issue or pull request. | ||
|
||
## Etiquette | ||
|
||
This project is open source, and as such, the maintainers give their free time to build and maintain the source code held within. They make the code freely available in the hope that it will be of use to other developers. It would be extremely unfair for them to suffer abuse or anger for their hard work. | ||
|
||
Please be considerate towards maintainers when raising issues or presenting pull requests. Let's show the | ||
world that developers are civilized and selfless people. | ||
|
||
It's the duty of the maintainer to ensure that all submissions to the project are of sufficient | ||
quality to benefit the project. Many developers have different skillsets, strengths, and weaknesses. Respect the maintainer's decision, and do not be upset or abusive if your submission is not used. | ||
|
||
## Viability | ||
|
||
When requesting or submitting new features, first consider whether it might be useful to others. Open | ||
source projects are used by many developers, who may have entirely different needs to your own. Think about | ||
whether or not your feature is likely to be used by other users of the project. | ||
|
||
## Procedure | ||
|
||
> [!NOTE] | ||
> Issue tracking is not currently enabled for this repository. We are organising it. | ||
Before filing an issue: | ||
|
||
- Attempt to replicate the problem, to ensure that it wasn't a coincidental incident. | ||
- Check to make sure your feature suggestion isn't already present within the project. | ||
- Check the pull requests tab to ensure that the bug doesn't have a fix in progress. | ||
- Check the pull requests tab to ensure that the feature isn't already in progress. | ||
|
||
Before submitting a pull request: | ||
|
||
- Check the codebase to ensure that your feature doesn't already exist. | ||
- Check the pull requests to ensure that another person hasn't already submitted the feature or fix. | ||
|
||
## Requirements | ||
|
||
If the project maintainer has any additional requirements, you will find them listed here. | ||
|
||
- **Add tests!** - Your patch won't be accepted if it doesn't have tests. | ||
|
||
- **Document any change in behaviour** - Make sure the `README.md` and any other relevant documentation are kept up-to-date. | ||
|
||
- **Consider our release cycle** - We try to follow [SemVer v2.0.0](https://semver.org/). Randomly breaking public APIs is not an option. | ||
|
||
- **One pull request per feature** - If you want to do more than one thing, send multiple pull requests. | ||
|
||
- **Send coherent history** - Make sure each individual commit in your pull request is meaningful. If you had to make multiple intermediate commits while developing, please [squash them](https://www.git-scm.com/book/en/v2/Git-Tools-Rewriting-History#Changing-Multiple-Commit-Messages) before submitting. | ||
|
||
**Happy coding**! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
# Security Policy | ||
|
||
If you discover any security related issues, please email security@oneofftech.xyz instead of using the discussions or the issue tracker. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,209 @@ | ||
# :card_box: Document Model Python | ||
![pypi](https://img.shields.io/pypi/v/parse-document-model-python.svg) | ||
[![Pydantic v2](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/pydantic/pydantic/main/docs/badge/v2.json)](https://docs.pydantic.dev/latest/contributing/#badges) | ||
[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE) | ||
|
||
# Parse Document Model (Python) | ||
|
||
**Parse Document Model** (Python) provides Pydantic models for representing text documents using a hierarchical model. | ||
This library allows you to define documents as a hierarchy of (specialised) nodes where each node can represent a document, page, text, heading, body, and more. | ||
|
||
These models aim to preserve the underlying structure of text documents for further processing, such as creating a table of contents or transforming between formats, e.g. converting a parsed PDF to Markdown. | ||
|
||
- **Hierarchical structure**: The document is modelled as a hierarchy of nodes. Each node can represent a part of the | ||
document itself, pages, text. | ||
- **Rich text support**: Nodes can represent not only the content but also the formatting (e.g. bold, italic) applied to the text. | ||
- **Attributes**: Each node can have attributes that provide additional information such as page number, | ||
bounding box, etc. | ||
- **Built-in validation and types**: Built with [`Pydantic`](https://docs.pydantic.dev/latest/), ensuring type safety, validation and effortless creation of complex document structures. | ||
|
||
|
||
**Requirements** | ||
|
||
- Python 3.12 or above (Python 3.9, 3.10 and 3.11 are supported on best-effort). | ||
|
||
|
||
**Next steps** | ||
|
||
- [Explore the document model](#document-model-overview) | ||
- [Install the library and use the models](#getting-started) | ||
|
||
|
||
## Document Model Overview | ||
|
||
We want to represent the document structure using a hierarchy so that the inherited structure is preserved when chapters, sections and headings are used. Consider a generic document with two pages, one heading per page and one paragraph of text. The resulting representation might be the following. | ||
|
||
``` | ||
Document | ||
├─Page | ||
│ ├─Text (category: heading) | ||
│ └─Text (category: body) | ||
└─Page | ||
├─Text (category: heading) | ||
└─Text (category: body) | ||
``` | ||
|
||
At a glance you can see the structure, the document is composed of two pages and there are two headings. To do so we defined a hierarchy around the concept of a Node, like a node in a graph. | ||
|
||
### Node types | ||
|
||
```mermaid | ||
classDiagram | ||
class Node | ||
Node <|-- StructuredNode | ||
Node <|-- Text | ||
StructuredNode <|-- Document | ||
StructuredNode <|-- Page | ||
``` | ||
|
||
|
||
#### 1. **Node** (Base Class) | ||
|
||
This is the abstract class from which all other nodes inherit. | ||
|
||
Each node has: | ||
|
||
- `category`: The type of the node (e.g., `doc`, `page`, `heading`). | ||
- `attributes`: Optional field to attach extra data to a node. See [Attributes](#attributes). | ||
|
||
#### 2. **StructuredNode** | ||
|
||
This extends the [`Node`](#1-node-base-class). It is used to represent the hierarchy as a node whose content is a list of other nodes, such as like [`Document`](#3-document) and [`Page`](#4-page). | ||
|
||
- `content`: List of `Node`. | ||
|
||
|
||
#### 3. **Document** | ||
|
||
This is the root node of a document. | ||
|
||
- `category`: Always set to `"doc"`. | ||
- `attributes`: Document-wide attributes can be set here. | ||
- `content`: List of [`Page`](#4-page) nodes that form the document. | ||
|
||
#### 4. **Page** | ||
|
||
Represents a page in the document: | ||
|
||
- `category`: Always set to `"page"`. | ||
- `attributes`: Can contain metadata like page number. | ||
- `content`: List of [`Text`](#5-text) nodes on the page. | ||
|
||
#### 5. **Text** | ||
|
||
This node represent a paragraph, a heading or any text within the document. | ||
|
||
- `category`: The type `"doc"`. | ||
- `content`: A string representing the textual content. | ||
- `marks`: List of [marks](#marks) applied to the text, such as bold, italic, etc. | ||
- `attributes`: Can contain metadata like the bounding box representing where this portion of text is located in the page. | ||
|
||
|
||
|
||
### Marks | ||
|
||
Marks are used to add style or functionality to the text within a [`Text`](#5-text) node. | ||
For example, bold text, italic text, links and custom styles such as font or colour. | ||
|
||
**Mark Types** | ||
|
||
- `Bold`: Represents bold text. | ||
- `Italic`: Represents italic text. | ||
- `TextStyle`: Allows customization of font and color. | ||
- `Link`: Represents a hyperlink. | ||
|
||
Marks are validated and enforced with the help of `Pydantic` model validators. | ||
|
||
### Attributes | ||
|
||
Attributes are optional fields that can store additional information for each node. Some predefined attributes are: | ||
|
||
- `DocumentAttributes`: General attributes for the document (currently reserved for the future). | ||
- `PageAttributes`: Specific page related attributes, such as the page number. | ||
- `TextAttributes`: Text related attributes, such as bounding boxes. | ||
- `BoundingBox`: A box that specifies the position of a text in the page. | ||
|
||
|
||
## Getting started | ||
|
||
### Installation | ||
|
||
Parse Document Model is distributed with PyPI. You can install it with `pip`. | ||
|
||
```bash | ||
pip install parse-document-model-python | ||
``` | ||
|
||
### Quick Example | ||
|
||
Here’s how you can represent a simple document with one page and some text: | ||
|
||
```python | ||
from document_model_python.document import Document, Page, Text | ||
|
||
doc = Document( | ||
category="doc", | ||
content=[ | ||
Page( | ||
category="page", | ||
content=[ | ||
Text( | ||
category="heading", | ||
content="Welcome to parse-document-model-python", | ||
marks=["bold"] | ||
), | ||
Text( | ||
category="body", | ||
content="This is an example text using the document model." | ||
) | ||
] | ||
) | ||
] | ||
) | ||
``` | ||
|
||
## Testing | ||
|
||
Parse Document Model is tested using [pytest](https://docs.pytest.org/en/stable/). Tests run for each commit and pull request. | ||
|
||
Install the dependencies. | ||
|
||
```bash | ||
pip install -r requirements.txt -r requirements-dev.txt | ||
``` | ||
|
||
Execute the test suite. | ||
|
||
```bash | ||
pytest | ||
``` | ||
|
||
|
||
## Contributing | ||
|
||
Thank you for considering contributing to the Parse Document Model! The contribution guide can be found in the [CONTRIBUTING.md](./.github/CONTRIBUTING.md) file. | ||
|
||
> [NOTE] | ||
> Consider opening a [discussion](https://github.com/OneOffTech/parse-document-model-python/discussions) before submitting a pull request with changes to the model structures. | ||
## Security Vulnerabilities | ||
|
||
Please review [our security policy](./.github/SECURITY.md) on how to report security vulnerabilities. | ||
|
||
## Credits | ||
|
||
- [OneOffTech](https://github.com/OneOffTech) | ||
- [All Contributors](../../contributors) | ||
|
||
## Supporters | ||
|
||
The project is provided and supported by [OneOff-Tech (UG)](https://oneofftech.de). | ||
|
||
<p align="left"><a href="https://oneofftech.de" target="_blank"><img src="https://raw.githubusercontent.com/OneOffTech/.github/main/art/oneofftech-logo.svg" width="200"></a></p> | ||
|
||
## Aknowledgements | ||
|
||
The format and structure takes inspiration from [ProseMirror](https://prosemirror.net/docs/ref/#model.Document_Schema). | ||
|
||
## License | ||
|
||
The MIT License (MIT). Please see [License File](LICENSE.md) for more information. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
from .document import Document, Page |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,27 @@ | ||
from abc import ABC | ||
|
||
from pydantic import BaseModel | ||
|
||
|
||
class BoundingBox(BaseModel): | ||
min_x: float | ||
min_y: float | ||
max_x: float | ||
max_y: float | ||
page: int | ||
|
||
|
||
class Attributes(BaseModel, ABC): | ||
pass | ||
|
||
|
||
class DocumentAttributes(Attributes): | ||
pass | ||
|
||
|
||
class PageAttributes(Attributes): | ||
page: int | ||
|
||
|
||
class TextAttributes(Attributes): | ||
bounding_box: list[BoundingBox] = [] |
Oops, something went wrong.