Skip to content

Corpus cleanup: progress tracker

Clemens Neudecker edited this page Nov 21, 2019 · 11 revisions

Corpus cleanup: progress tracker

This page allows tracking the progress of the corpus cleanup.

  • enp_de.bio
    • L000001 - L002000]
  • enp_fr.bio
    • L000001 - L082493]
  • enp_de.bio
    • L000001 - L002000]
  • enp_de.bio
    • L000001 - L002000]
  • enp_de.bio
    • L000001 - L002000]
  • enp_de.bio
    • L000001 - L002000]
  • enp_fr.bio
  • enp_de.bio
    • L000001 - L005000]

Other

  • fix use of B-/I- in compliance with CoNLL convention
  • replace B/I-LIEU with B/I-LOC and B/I-PERS with B/I-PER in French corpus
  • assemble basic metadata (titles, issues, dates)
  • harmonize files and directories
  • remove unnecessary 'POS' tags (reduces file size considerably)
  • remove metadata noise and empty files (0 bytes)
  • consistently use spaces instead of tabs