Skip to content

Commit

Permalink
git, not svn.
Browse files Browse the repository at this point in the history
  • Loading branch information
Trondtr committed Feb 7, 2024
1 parent b06698d commit ce7f25a
Showing 1 changed file with 9 additions and 71 deletions.
80 changes: 9 additions & 71 deletions ling/corpus_repositories.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,52 +5,19 @@ The corpus is divided in a free part, where texts that we can
redistribute are gathered (essentially, texts in the public domain), and
a bound part, where we gather texts that we can't redistribute.

# The free part
For each language, there are two corpora: One with open content and one with closed (closed due to copyright reasons). Both corpora are in two, directories, one for original files (*.docx, .pdf, .html, ...*) and one for converted text files. Search for the corpora under **Repositories** in [github.com/giellalt/](https://github.com/giellalt/). In the repository search field, write

The free part is available in an svn repository that is world readable
available on the address <https://gtsvn.uit.no/freecorpus/>.
```
corpus-xxx
```

Check out a working copy either by adding the above address to your svn
program, or use the command line program like this:
`svn checkout https://gtsvn.uit.no/freecorpus`.
where **xxx** is the ISO code of your language

# The bound part

The bound part is available in an svn repository that is accessible only
from the machine gtsvn.uit.no. This repository is readable by people
that have signed [our time limited non disclosure
agreement](http://divvun.no/adm/legal/sd-contract-part-3.html), and is
writable by the corpus maintainers. The address to this repository is
svn://gtsvn.uit.no/boundcorpus.

To check out a working copy, first login to victorio. Then issue the
command `svn co svn://gtsvn.uit.no/boundcorpus `.

If you would like to have access to the bound corpus, contact us at
[feedback@divvun.no](mailto:feedback@divvun.no?subject=Corpus%20access)

# The directory structure

The main directory in the corpus repository is *orig*. *orig* contains
the original files with the original names (bar spaces replaced with \_)
and in the original format. Metadata about the original file is in an
xsl file in the same directory as the original file, and has the same
name as the original file plus an .xsl extension.

Inside these directories are directories for various languages. They
contain parallel translations to the sami documents. Below is an outline
of the directory structure.

orig/
eng/
fin/
nob/
sma/
sme/
smj/
swe/

Inside each of the language directories the structure outlined below is

Inside each of the corpus directories (both *corpus-xxx* and *corpus-xxx-orig*) the structure outlined below is
used.

admin/
Expand All @@ -74,13 +41,6 @@ used.
science/

If you run the [conversion process](corpus_conversion.html) the
directory *converted* is created. It has the same structure as *orig*,
but contains the files converted from the original format to our
internal xml-format. The converted directory contains copies of all the
files in the corpus database.

## Goldstandard corpus files

These 7 overarching genres should be understood as follows:

Expand All @@ -100,7 +60,7 @@ These 7 overarching genres should be understood as follows:
- **science/:**
Scientific articles

Todo: Write more about this.


## Goldstandard corpus files

Expand All @@ -110,7 +70,7 @@ to add info about linguistic errors of different types and their
corrections. These files are located within a directory named
goldstandard, which has the same internal structure as shown above:

orig/ # same orig/ as above
orig/ # same orig/ as above
goldstandard/
orig/
sma/
Expand All @@ -131,25 +91,3 @@ To add all files found in a directory to a working copy of a corpus, you
can use the
[add\_files\_to\_corpus](CorpusTools.html#add_files_to_corpus) program.

If you have a file that you want to add to the corpus repository you
have to have a working copy of either the free or bound part of our
corpus.

1. If it is a freely distributable document, add it to the free part of
our corpus. If not, add it to the bound part.
2. Determine the main language:
`eng, fin, nno, nob, sma, sme, smj, swe`, and then genre:
`admin, bible, facta, ficti, laws, news`.
3. Copy the file to the directory $working\_copy/orig/$lang/$genre/.
For example, a North Saami document in genre "facta", copy the file
to directory $working\_copy/orig/sme/facta/ in your working copy.
For the news genre, there are subdirectories for different sources
of news, and for bible texts, there are subdirectories for ot and
nt.
4. Try to convert the document using the command
`convert2xml <filename>`. This command converts the original
document to xml, and makes a default metadata document with empty
values.
5. Open the metadata file, &lt;filename&gt;.xsl and add metadata to it.
6. Add and commit these files to the repository by either using the svn
command line tools or your usual graphical interface to svn.

0 comments on commit ce7f25a

Please sign in to comment.