Skip to content

Releases: ispras/dedoc

v2.3

19 Sep 13:24
724e2d2
Compare
Choose a tag to compare
  • Dedoc telegram chat created.
  • Added patterns parameter for configuring default structure type.
  • Added notebooks with Dedoc usage (see issue 484).
  • Fix bug OutOfMemoryError: Java heap space in PdfTabbyReader (see issue 489).
  • Fix bug with numeration in DocxReader (see issue 494).
  • Added GOST (Russian government standard) frame recognition in PdfImageReader and PdfTxtlayerReader (need_gost_frame_analysis parameter).

v2.2.7

16 Aug 13:35
765aae2
Compare
Choose a tag to compare
  • Fix bugs with start, end of BBoxAnnotation in PdfTabbyReader.
  • Improve columns classification and orientation detection for PDF and images (is_one_column_document and document_orientation parameters).
  • Upgrade docker: docker-compose is no longer supported, use docker compose instead.
  • Fix bug of tables parsing in DocxReader (see issue).
  • Added simple textual layer detection in PdfAutoReader (fast_textual_layer_detection parameter).
  • Improve paragraph extraction from PDF documents and images.
  • Retrain a classifier for diplomas (document_type="diploma") on a new dataset.

v2.2.6

22 Jul 14:20
d67e6ef
Compare
Choose a tag to compare
  • Upgrade dependencies: numpy<2.0 and dedoc-utils==0.3.7.

v2.2.5

15 Jul 13:14
8a2678c
Compare
Choose a tag to compare
  • Added internal functions and classes to support integration of Dedoc into langchain
  • Upgrade some dependencies, in particular, xgboost>=1.6.0, pandas, pdfminer.six

v2.2.4

20 Jun 11:47
5750d57
Compare
Choose a tag to compare
  • Show page division and page numbers in the HTML output representation (API usage, return_format="html").
  • Make imports from dedoc library faster.
  • Added tutorial how to add a new language to dedoc (not finished entirely).
  • Added additional page_id metadata for multi-page nodes (structure_type="tree" in API, TreeConstructor in the library).
  • Updated OCR and orientation/columns classification benchmarks.
  • Minor edits of README.md.
  • Fixed empty cells handling in CSVReader.
  • Fixed bounding boxes extraction for text in tables for PdfTabbyReader.

v2.2.3

05 Jun 08:47
370f6ef
Compare
Choose a tag to compare
  • Show attached images and added ability to download attached files in the HTML output representation (API usage, return_format="html").
  • Added hierarchy level information and annotations to PptxReader.

v2.2.2

21 May 13:00
Compare
Choose a tag to compare
  • Added images extraction to ArticleReader.
  • Added attachments and references to them in the HTML output representation (return_format="html").
  • Fixed functionality of parameter need_content_analysis.
  • Fixed CSVReader (exclude BOM character from the output).
  • Added handling files with wrong extension or without extension to DedocManager (detect file type by its content).
  • Update README.md.

v2.2.1

03 May 13:23
0acfb3d
Compare
Choose a tag to compare
  • Added fintoc structure type for parsing financial prospects according to the FinTOC 2022 Shared task (FintocStructureExtractor).
  • Fixed small bugs in ArticleReader: colspan for tables, keywords, sections numbering, etc.
  • Added references to nodes and fixed small bugs in the HTML output representation (return_format="html").
  • Removed other_fields from LineMetadata and DocumentMetadata.
  • Update README.md.

v2.2

17 Apr 10:02
16747b0
Compare
Choose a tag to compare
  • PdfTabbyReader improved: bugs fixes, speed increase of partial PDF extraction (with parameter pages).
  • Added benchmarks for evaluation of PDF readers performance.
  • Added ReferenceAnnotation class.
  • Fixed bug in can_read method for all readers.
  • Added article structure type for parsing scientific articles using GROBID (ArticleReader, ArticleStructureExtractor).

v2.1.1

22 Mar 08:17
56b44dd
Compare
Choose a tag to compare
  • Update README.md.
  • Update table and time benchmarks.
  • Re-label line-classifier datasets (law, diploma, paragraphs datasets).
  • Update tasker creators (for the labeling system).
  • Fix HTML table parsing.