Recognize_headlines

Documentation

This is an example of a workflow to extract headlines in the "nationalsozialistische" newspaper "Freiheitskampf" using the tool LayoutParser. The extraction of headlines should be used to find manually annotated articles by HAIT researchers in the daily newspaper faster. Furthermore, a quantitative analysis could be performed in terms of size and wording of the headlines. This could give conclusions about the propagandistic effect of certain articles. The headline extraction will be complemented by a subsequent OCR processing.

Installation

First of all you need to create a virtual environment for python. You can also follow the instructions without a virtual environment, but that is not recommended.

python3.7 -m venv "name of your virtual environment"
source name_of_env/bin/activate 	#with this command you can activate the environment

#if you want to deactivate the environment just type:
deactivate 
#but for the installation process you need to be in the active environment

Second we need to install the packages for Layoutparser:


pip install layoutparser	#will install the base LayoutParser library (layout data structure and visualization, Load/export the layout data)
pip install torchvision && pip install "detectron2@git+https://github.com/facebookresearch/detectron2.git@v0.5#egg=detectron2"	#will install detectron2, which is needed to create a detection model with self trained data

Third we need to install an OCR Processor. For this task i chose Tesserocr. For Tesserocr you need to install some Prerequisites. See here

sudo apt-get install tesseract-ocr libtesseract-dev libleptonica-dev pkg-config
 pip install tesserocr

Usage

In the folder "Example_notebooks" you will find further explanation on how to use the tool and an example workflow. It explains once how to extract the headlines with a self-trained model and then perform OCR recognition using Tesserocr.

Training

The folder "Train_LayoutParser" contains explanations and instructions for annotating your own data. In addition, there is also an example notebook on which a training workflow was carried out with the help of Google Colab.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Example_notebooks		Example_notebooks
Train_LayoutParser		Train_LayoutParser
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Recognize_headlines

Documentation

Installation

Usage

Training

About

Releases

Packages

Languages

HAIT-TUDD/Headline_extraction

Folders and files

Latest commit

History

Repository files navigation

Recognize_headlines

Documentation

Installation

Usage

Training

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages