Skip to content

ElevenPaths/FARO

 
 

Repository files navigation

FARO (Document Sensitivity Detector)

FARO LOGO

Table of contents

What is this

FARO is a tool for detecting sensitive information in documents in an organization. It is oriented to be used by small companies and particulars that want to track their sensitive documents inside their organization but who cannot spend much time and money configuring complex Data Protection tools.

FARO extract sensitivity indicators from documents (e.g. Document IDs, monetary quantities, personal emails) and gives a sensitivity score to the document (from low to high) using the frequence and type of the indicators in the document.

Currently all the functionality of this tool is for documents written in Spanish although it can be easily upgraded to cover more languages.

This tool is developed by TEGRA R&D Cybersecurity Center.

IF you are in a rush and just want to give it a try...go here

What’s in here?

The project contains the following folders:

  • faro/ : this is FARO module with the main functionality and tests
  • conf/: yaml configuration files go here. There is one yaml file per language (plus one nolanguage.yaml to provide basic functionality for non detected languages) and one yaml file with common configurations for all languages config/commons.yaml.
  • plugins/: Stores all the available plugins to detect sensitive information with the appropiate language support.
  • utils/: Utilities for faro execution, for example pre-process of texts and root classes to implement common plugin functionality.
  • docker/: Everything related with the execution of faro in a container squema.
  • test/: Unit tests for faro.
  • logs/ and logger/: Definition and storage of logging.
  • faro_detection.py: launcher of FARO for standalone operation over a single file.
  • faro_spider.sh: script for bulk processing.
  • nose.cfg: Configuration for testing faro
  • CHANGELOG: FARO changelog.

Run FARO with Docker

FARO can be executed inside docker, in fact it is our recommended configuration. We are using docker-compose since FARO requires two containers to be working together faro and tika.

Everything you need is provided inside the docker folder. For all the details around how to get that done checkout the documentation here

Run FARO On Host Machine

NOTE: ONLY LINUX AND MAC OS X

Prerequisites

The mode requires some operative system and libraries in order to work properly

  • Linux or Mac OS
  • Java 1.7 or higher
  • virtualenv (not required but highly recommended)
  • GNU parallel to speed up our spider script. You can see more information about the tool here
  • Apache tika up and running -> Check tika tooling

FARO is decoupled from tika so you need to make sure that you have a running tika server and that you point FARO towards that server endpoint TIKA_SERVER_ENDPOINT. Check this wiki page on resources to get you started.

Virtual Environment

It is advisable to use a separate virtual environment. To instantiate a virtual environment with virtualenv.

virtualenv -p `which python3` <yourenvname>

To activate the virtual environment on your terminal just type:

source <yourenvname>/bin/activate

Dependencies

The easiest way of getting the system up and running is to install the dependencies this way

pip install -r requirements.txt

The list of dependencies are the following:

  • SpaCy
  • fuzzywuzzy
  • tika
  • pyyaml

These other dependencies are used for testing:

  • coverage
  • nose

Tika dependency

We provide some utilities in order to get tika server up and running on your local machine in case is useful donwload this zip file and uncompress somewhere in your local filesystem.

To fire up tika run:

$ tika_start.sh

To stop tika server:

$ tika_stop.sh

NER models

FARO relies on several pretrained ML models in order to work.

ner_config:
    nlp_model : FARO uses two Spacy models es_core_news_sm (spanish) and xx_ent_wiki_sm (multilanguage)

Those models are downloades as part of the requirements.txt file.

FARO spider

Our spider is a script to recursively analyse the documents inside a folder, storing the results of the analysis on a file.

./faro_spider.sh <your folder with files>

Results

FARO creates an "output" folder inside the parent folder of docker normally the root folder for faro project if you have performed a git clone:

  • output/scan.$CURRENT_TIME.csv: is a csv file with the score given to the document and the frequence of indicators in each file.
filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr
/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf,high,0,0,0,0,0,1,4,application/pdf,Powered By Crystal,1,es,,85739,219,1185,False
/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf,high,6,0,0,0,0,2,8,application/pdf,Python PDF Library - http://pybrary.net/pyPdf/,1,es,,77171,259,1524,True
/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf,high,3,0,0,0,0,1,2,application/pdf,FPDF 1.6,1,es,2019-09-12T20:08:19Z,1545,62,648,False
  • output/scan.$CURRENT_TIME.entity: is a json with the list of indicators (disaggregated) extracted in a file. For example:
{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf", "entities": {"custom_word": {"facturar": 3, "total": 1}, "probable_currency_amount": {"12,0021": 1, "12,00": 1, "9,92": 1, "3,9921": 1, "3,99": 1, "3,30": 1, "15,99": 1, "13,21": 1, "1.106.166": 1, "1,00": 1, "99,00": 1}, "id_document": {"89821284M": 1}}, "datetime": "2019-12-11 14:19:17"}
{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/Factura_Plancha.pdf", "entities": {"id_document": {"H82547761": 1, "21809943D": 2}, "custom_word": {"factura": 2, "facturar": 2, "total": 2, "importe": 2}, "money": {"156,20": 4, "2,84": 2, "0,00": 2, "159,04": 2, "32,80": 4, "191,84": 2}, "probable_currency_amount": {"1,00": 6, "189,00": 2}}, "datetime": "2019-12-11 14:19:27"}
{"filepath": "/Users/test/code/FARO_datasets/quick_test_data/20190912-FS2019.pdf", "entities": {"document_id": {"C-01107564": 1}, "custom_word": {"factura": 1, "total": 1}, "money": {"3,06": 1, "0,64": 1, "3,70": 1}}, "datetime": "2019-12-11 14:19:33"}

Finetuning Faro Execution

After adding OCR there are some configuration that can be customized for FARO execution through environment variables:

  • FARO_DISABLE_OCR: if this variable is found (with any value) FARO will not execute OCR on the documents
  • FARO_REQUESTS_TIMEOUT: Number of seconds before FARO will timeout if the tika server does not respond (default: 300)
  • FARO_PDF_OCR_RATIO: Bytes per character used in PDF mixed documents (text and images) to force OCR (default: 500 bytes/char)

Logging configuration can also be configured through environment variables:

  • FARO_LOG_LEVEL: Faro Logging level (default: INFO)
  • FARO_LOG_FILE: Faro Logging file (default: None). While using docker make sure to set it inside the output folder to persist it in the host machine.

Local Testing

FARO has several tests to verify the functionality of the system. Tests can be executed with the following command:

nosetests -c nose.cfg

Checkout nosetests documentation for more options.

Single file detection

You can execute faro detection over a single file by using our faro_detection.py script

./faro_detection.py -i <your_file>

Two output files are generated with the paths <your_file>.entity and <your_file>.score.

a) <your_file>.entity: a json with the list of entities ordered by their type and the number of appearances (output of the entity detector module):

{"MONEY": {"1.000 euros": 2}, "PER": {"Betty Corti\u00f1as": 1, "Eva Expósito": 1, "Belén Portela": 1, "Marta Rivadulla": 1, "Miguel Rivas": 1}, "PROF": {"el tutor": 1}, "ORG": {"Centro de Recursos Educativos": 1}}

b) <your_file>.score: a json with the types of entities and the number this type of entity appears in the text. This json also contains the sensitivy score in the property "score" (it can be "low", "medium" and "high").

{"score": "high", "summary": {"money": 1, "mobile": 1, "personal_email": 1, "financial_data": 2}}

For information about additional arguments that can be passed to our detection script, take a look here.

Technical Details

FARO entity detector

The FARO entity detector performs two steps:

1 Extraction of sensitivity indicators: the indicators are entities and other text elements (e.g. monetary quantities) which likely appear in sensitivity documents.

The list of indicators are the following:

  • money: money quantity (currently only euros and dollars are supported).

  • signature: it outputs the person who signs a document

  • personal_email: emails that are not corporative (e.g. not info@ rrhh@ )

  • mobile: mobile phone numbers (filtering out non mobile ones)

  • financial_data: credit cards and IBAN account numbers

  • id_document: Spanish NIF and CIF.

The unique counts of these sentences are gathered in a json object and relayed as input to the next step.

2 Scoring the sensitivity of a document: a classification rule is applied (using thresholds) to the indicators extracted in phase 1 to assign a sensitivy score to the document.

The following rules are applied:

  • Every sensitivity level sets thresholds for the sentivity indicators. A document must comply with at least one of the thresholds (min and max) in order to get that score.

  • If different sensitivity thresholds appear in the document (currently configured to three), the documents levels up the sensitivy score in spite of complying with all the thresholds for the level.

  • The "low" score is also assigned to documents where no sensitivity indicator was found

Configuration

It employs a YAML set of files for configuring its functionality (the YAML files are located inside the "conf" folder)

  • common.yaml: has the common configuration for the tool.
  • config.py: Sets the logging for faro execution

Configuration of the sensitivity score

Those are a collection of conditions that selects a score following the specification of the configuration file. The levels are configured in the sensitivity_list sorted by their intensity (from less to more sensitive). The sensitivity dict contains the conditions (min, max) ordered by type of entity. The system only needs to fulfill one condition of a certain level in order to flag the document with that level of sensitivity. Furtheremore if multiple KPIs of a certain leve are found in the document (as marked by the sensitivity_multiple_kpis parameter), the system increases their sensitivity level (e.g. from medium to high).

sensitivity:
  sensitivity_list:
      - low
      - medium
      - high
  sensitivity_multiple_kpis: 3
  • sensitivity_list is the list of different sensitivity scores ordered by intensity.

  • sensitivity_multiple_kpis this number indicates the simultaneous number of scores in a level allowed before leveling up the sensitivy score

Also each entity can be configured in terms of the amount of presence needed to be scored as each level: low, medium or high. by using a sensitivity dict with the sensitivity conditions that must be satisfied in order to reach a sensitivity level.

entities:
  MONEY:
    description: money
    output: true
    sensitivity:
      low:
        min: 1
        max: 6
      medium:
        min: 6
        max: 65535
      high:
        min: 65535
        max: 65535
        ....

Supported Input File Formats

The FARO application uses Tika for document processing. Therefore all the formats that Tika process can be used as input.

Techniques

FARO uses NER (built with CRFs) for extracting classic entities: Person, Organization and Location.

Other indicators are extracted with RegExp (document ids, phone and credit card numbers, etc) with further validation and context for realiability.

Mails are extracted with RegExp. A ML classifier and heuristics are used to distinguish between corporative and personal emails.

Faro Detection Additional Arguments

--dump: the system dumps the information of <your_file>.score to stdout in csv format. E.g. an example of output might be:

filepath,score,money,signature,personal_email,mobile,financial_data,id_document,custom_word,meta:content-type,meta:encrypted,meta:author,meta:pages,meta:lang,meta:date,meta:filesize,meta:ocr
/Users/test/code/FARO_datasets/quick_test_data/Factura_NRU_0_1_001.pdf,high,0,0,0,0,0,1,4,application/pdf,Powered By Crystal,1,es,,85739,219,1185,False

The paths of the output files can be explicitly set in the command line using --output_entity_file and --output_score_file

python faro_detection.py --input_file <your_file> --output_entity_file <path to output> --output_score_file <path to output>

The default behaviour of our detection script is to show only the type of entities that directly affects the sensitivity score. In order to show all the entities detected use the parameter --verbose at the command line.

There is an additional parameter (--split_lines) that must be used with documents in which every line of the document is a sentence (or paragraph). By default, FARO tries to join lines in the document because in many cases a different line does not imply a different sentence (e.g. in PDFs).

Known Issues

  • The full functionality only works with Spanish documents although it is easily expandable with new languages (especially if they are supported with SpaCy, the NLP tool used to process the sentence, documents).

  • The system uses SpaCy for parsing and PoS sentence preprocessing. Also FARO uses SpaCy trained NER system for classical entities.

Contributors

TEGRA is an R&D Cybersecurity Center based in Galicia (Spain). It is a joint effort from Telefónica, a leading international telecommunications company, through ElevenPaths, its global cybersecurity unit, and Gradiant, an ICT R&D center with more than 100 professionals working in areas like connectivity, security and intelligence, to create innovative products and services inside cybersecurity.

TEGRA's work is focused on two areas within the cybersecurity landscape: Data Security and Security Analytics. We are committed to creating state-of-the-art technologies that can nurture and thus provide differentiating value to our products.

See the CONTRIBUTORS file.

About

FARO - Document Sensitivity Detector

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages

  • Python 89.6%
  • Shell 3.7%
  • Dockerfile 3.3%
  • Rich Text Format 2.0%
  • Batchfile 1.4%