git, not svn.

giellalt · Feb 7, 2024 · ce7f25a · ce7f25a
1 parent b06698d
commit ce7f25a
Showing 1 changed file with 9 additions and 71 deletions.
diff --git a/ling/corpus_repositories.md b/ling/corpus_repositories.md
@@ -5,52 +5,19 @@ The corpus is divided in a free part, where texts that we can
 redistribute are gathered (essentially, texts in the public domain), and
 a bound part, where we gather texts that we can't redistribute.
 
-# The free part
+For each language, there are two corpora: One with open content and one with closed (closed due to copyright reasons). Both corpora are in two, directories, one for original files (*.docx, .pdf, .html, ...*) and one for converted text files.  Search for the corpora under **Repositories** in [github.com/giellalt/](https://github.com/giellalt/). In the repository search field, write
 
-The free part is available in an svn repository that is world readable
-available on the address <https://gtsvn.uit.no/freecorpus/>.
+```
+corpus-xxx
+```
 
-Check out a working copy either by adding the above address to your svn
-program, or use the command line program like this:
-`svn checkout https://gtsvn.uit.no/freecorpus`.
+where **xxx** is the ISO code of your language 
 
-# The bound part
-
-The bound part is available in an svn repository that is accessible only
-from the machine gtsvn.uit.no. This repository is readable by people
-that have signed [our time limited non disclosure
-agreement](http://divvun.no/adm/legal/sd-contract-part-3.html), and is
-writable by the corpus maintainers. The address to this repository is
-svn://gtsvn.uit.no/boundcorpus.
-
-To check out a working copy, first login to victorio. Then issue the
-command `svn co             svn://gtsvn.uit.no/boundcorpus `.
-
-If you would like to have access to the bound corpus, contact us at
-[feedback@divvun.no](mailto:feedback@divvun.no?subject=Corpus%20access)
 
 # The directory structure
 
-The main directory in the corpus repository is *orig*. *orig* contains
-the original files with the original names (bar spaces replaced with \_)
-and in the original format. Metadata about the original file is in an
-xsl file in the same directory as the original file, and has the same
-name as the original file plus an .xsl extension.
-
-Inside these directories are directories for various languages. They
-contain parallel translations to the sami documents. Below is an outline
-of the directory structure.
-
-    orig/
-        eng/
-        fin/
-        nob/
-        sma/
-        sme/
-        smj/
-        swe/
-
-Inside each of the language directories the structure outlined below is
+
+Inside each of the corpus directories (both *corpus-xxx* and *corpus-xxx-orig*) the structure outlined below is
 used.
 
     admin/
@@ -74,13 +41,6 @@ used.
     science/
          
 
-If you run the [conversion process](corpus_conversion.html) the
-directory *converted* is created. It has the same structure as *orig*,
-but contains the files converted from the original format to our
-internal xml-format. The converted directory contains copies of all the
-files in the corpus database.
-
-## Goldstandard corpus files
 
 These 7 overarching genres should be understood as follows:
 
@@ -100,7 +60,7 @@ These 7 overarching genres should be understood as follows:
 -   **science/:**  
     Scientific articles
 
-Todo: Write more about this.
+
 
 ## Goldstandard corpus files
 
@@ -110,7 +70,7 @@ to add info about linguistic errors of different types and their
 corrections. These files are located within a directory named
 goldstandard, which has the same internal structure as shown above:
 
-    orig/         # same orig/ as above
+    orig/                 # same orig/ as above
     goldstandard/
         orig/
             sma/
@@ -131,25 +91,3 @@ To add all files found in a directory to a working copy of a corpus, you
 can use the
 [add\_files\_to\_corpus](CorpusTools.html#add_files_to_corpus) program.
 
-If you have a file that you want to add to the corpus repository you
-have to have a working copy of either the free or bound part of our
-corpus.
-
-1.  If it is a freely distributable document, add it to the free part of
-    our corpus. If not, add it to the bound part.
-2.  Determine the main language:
-    `eng, fin, nno, nob, sma, sme, smj, swe`, and then genre:
-    `admin, bible, facta, ficti, laws, news`.
-3.  Copy the file to the directory $working\_copy/orig/$lang/$genre/.
-    For example, a North Saami document in genre "facta", copy the file
-    to directory $working\_copy/orig/sme/facta/ in your working copy.
-    For the news genre, there are subdirectories for different sources
-    of news, and for bible texts, there are subdirectories for ot and
-    nt.
-4.  Try to convert the document using the command
-    `convert2xml <filename>`. This command converts the original
-    document to xml, and makes a default metadata document with empty
-    values.
-5.  Open the metadata file, &lt;filename&gt;.xsl and add metadata to it.
-6.  Add and commit these files to the repository by either using the svn
-    command line tools or your usual graphical interface to svn.