sonnet-analysis.html

<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
   <head>
      <title>Sex, death and sonnetsMusings of a software developer</title>
      <style type="text/css" media="all">
body {
    color: black;
    font-family: serif;
    font-weight: normal;
    font-size: 12pt;
    margin-top: 5%;
    margin-left: 5%;
    margin-right: 5%;
}

blockquote.abstract {
    font-style: normal;
    width: 80%;
    font-size: 90%;
    text-align: left;
    margin-left: +10%;
    background-color: white;
}

blockquote {
    font-style: italic;
    text-align: left;
    margin-left: +10%;
    background-color: white;
}

dl, p, li {
    text-align: left;
    margin-left:  +1%;
    margin-right: +1%;
    background-color: white;
}

p.navigation {
    text-align: right;
    background-color: white;
}

p.right {
    text-align: right;
    background-color: white;
}

div#table-of-contents {
    position: fixed;
    font-size:80%;
    left:0.1em;
    top:1.0em;
    width: 16em;
    height: 40em;
    overflow: scroll;
    background-color: white;
}

div#main-content {
    position: absolute;
    top: 0.5em;
    left: 17em;
    margin-right: +2%;
    background-color: white;
}


div {
    text-align: left;
    margin-left:  +2%;
    background-color: white;
}


pre {
    font-family: courier,fixed;
    margin-left:  +2%;
    width: 95%;
}

p.author {
    font-style: italic;
    text-align: center;
    background-color: white;
}

h1.title {
    font-size: 120%;
    font-family: sans-serif;
    font-weight: bold;
    text-align: center;
    background-color: white;
    margin-left:  +2%;
    margin-right: +2%;
}

.level1 {
    background-color: white;
}

.level2 {
    background-color: white;
}

.level3 {
    background-color: white;
}

.level4 {
    background-color: white;
}

h1 {
    font-weight: bold;
    font-family: sans-serif;
    background-color: white;
    margin-left:  +2%;
    margin-right: +2%;
    font-size: 110%;
}

h2 {
    font-family: sans-serif;
    font-weight: bold;
    background-color: white;
    border-top: none;
    border-bottom: solid thin;
    border-right: none;
    border-left: none;
    font-size: 110%;
}


h3 {
    font-family: sans-serif;
    font-weight: bold;
    background-color: white;
    border-top: none;
    border-bottom: dashed thin;
    border-right: none;
    border-left: none;
    font-size: 110%;
}

h4 {
    font-family: sans-serif;
    font-weight: bold;
    background-color: white;
    border-top: none;
    border-bottom: none;
    border-right: none;
    border-left: none;
    font-size: 110%;
}

table {
    text-align: left;
    width: 80%;
    color: black;
    margin-left: +5%;
    vertical-align: top;
}

caption {
    text-align: center;
    color: black;
    background-color: #eeeeee;
    font-weight: bold;
    vertical-align: top;
    margin-left: +5%;
}

td {
    text-align: left;
    color: black;
    background-color: #ffffff;
    vertical-align: top;
/*border-top: solid thin;
border-bottom: none;
border-right: none;
border-left: solid thin;*/
}

th {
    text-align: left;
    color: black;
    background-color: #ffffff;
    vertical-align: top;
}

table.navigation {
    text-align: center;
    color: black;
    background-color: white;
    vertical-align: top;
}

td.left {
    text-align: left;
    color: black;
    background-color: white;
    vertical-align: top;
    border: none;
}

td.center {
    text-align: center;
    color: black;
    background-color: white;
    vertical-align: top;
    border:none;
    background-color: white;
}

td.right {
    text-align: right;
    color: black;
    background-color: white;
    vertical-align: top;
    border:none;
    background-color: white;
}

span.biblAuthor {
    font-variant: small-caps;
}

span.red {
    color: red;
}

a:active {
    color: #660000;
}

a:link {
    color: #000066;
}

a:visited {
    color: #660000;
}

hr {
    border-top: solid thin;
    border-bottom: none;
    border-right: none;
    border-left: none;
} 

p.version {
    text-align: right;
    width: 30%;
    margin-left:  65%;
}
</style>
   </head>
   <body>
      <h1 class="title">Sex, death and sonnets<br/>Musings of a software developer</h1>
      <p class="author">Sigfrid Lundberg<br/>slu@kb.dk<br/>Digital Transformation<br/>Royal Danish Library<br/>Post box 2149<br/>1016 Copenhagen K<br/>Denmark<br/>
      </p>
      <blockquote class="abstract">
         <h3>Abstract</h3>
         <p>This note discusses how software can recognize sonnets, by
	analysis of text length, strophe structure and number of syllables
	per line. It also makes a simple content analysis based on
	word frequency analyses.</p>
         <p>The results clearly shows that simple Unix™ for Poets
        analyses combines seamlessly with TEI markup and XML technologies.</p>
      </blockquote>
      <h2>Introduction</h2>
      <p>If there are any sonnets, do they rhyme and what are they about?</p>
      <p>I have since many years been a great fan of the tutorial <em>Unix™ for Poets</em> by <a href="#kennethchurch">Kenneth Ward Church.</a>
        This note is an investigation of what can be done with a corpus of literary text with very simple tools similar to the ones described by Church in his tutorial.
        I do not claim that there is anything novel or even significant in
        this text. Being a scientist, I think like a scientist and don't
        expect any deep literary theory here.</p>
      <h2>Finding poems</h2>
      <p>The ADL text corpus contains <a href="#adlcorpus">literary texts.</a>
        Since the texts are encoded according to the <a href="#teiguidelines">TEI guidelines</a> it is easy to find poetry in those files.
        Typically a piece of poetry is encoded as <a href="#tei-ref-lg">lines within line groups</a>.
        More often than not the line groups are embedded in <kbd>&lt;div&gt; ... &lt;/div&gt;</kbd> elements.</p>
      <p>A poem may look like this in the source.
        The poem is by <a href="#sophus">Sophus Michaëlis (1883).</a>
      </p>
      <pre>
&lt;div decls="#biblid68251"&gt;
   &lt;head&gt;Jeg elsker —&lt;/head&gt;
   &lt;lg&gt;
      &lt;l&gt;Jeg elsker Himlens høje Harmoni,&lt;/l&gt;
      &lt;l&gt;dens Purpurblomst, som blaaner i det Fjærne,&lt;/l&gt;
      &lt;l&gt;den Fred, som risler ned fra Nattens Stjerne,&lt;/l&gt;
      &lt;l&gt;det Glimt af Gud, der glider mig forbi;&lt;/l&gt;
   &lt;/lg&gt;
    &lt;lg&gt;
      &lt;l&gt;og Evighedens tavse Melodi,&lt;/l&gt;
      &lt;l&gt;de svundne Slægters kaldende Orkester,&lt;/l&gt;
      &lt;l&gt;et Tonehav om en usynlig Mester,&lt;/l&gt;
      &lt;l&gt;en Klang af Gud, der bruser mig forbi;&lt;/l&gt;
   &lt;/lg&gt;
   &lt;lg&gt;
      &lt;l&gt;en magisk Magt fra Hjertets mørke Celle,&lt;/l&gt;
      &lt;l&gt;de stærke Længsler, som mod Lyset vælde,&lt;/l&gt;
      &lt;l&gt;Naturens evigunge Fantasi;&lt;/l&gt;
   &lt;/lg&gt;
   &lt;lg&gt;
      &lt;l&gt;det Liv, der spirer midt i selve Døden,&lt;/l&gt;
      &lt;l&gt;den Sol, der stiger midt i Aftenrøden,&lt;/l&gt;
      &lt;l&gt;— o Glimt af Gud, der glider mig forbi!&lt;/l&gt;
   &lt;/lg&gt;
   &lt;p&gt;
      &lt;date&gt;12. Septbr. 1893.&lt;/date&gt;
   &lt;/p&gt;
&lt;/div&gt;
</pre>
      <p>The default name space is declared as
        xmlns="http://www.tei-c.org/ns/1.0", which we in following refer to
        with the namespace prefix 't'.</p>
      <p>The poem comprises four line groups with four, four, three
        and three lines. That is a very common strophe structure
        (according to the <a href="#sonnets">Sonnets</a> article
        in Wikipedia), at least in Scandinavia. It is not always like
        that, but they all contain 14 lines.</p>
      <p>Shakespeare wrote often his 14 lines typographically in one
        strophe, whereas Francesco Petrarca wrote them in two strophes
        with eight and six lines, respectively (again see article
        <a href="#sonnets">Sonnets</a> in Wikipedia).</p>
      <p>To be more precise, a sonnet has one more characteristics
        than having 14 lines, the lines should be in <a href="#pentameter">iambic pentameter.</a>
      </p>
      <h2>Finding sonnets</h2>
      <p>You can easily find all poems in the corpus based on a
        XPATH query like:</p>
      <pre> 
        //t:div[t:lg and @decls]
        </pre>
      <p>We can use that query in XSLT like this:</p>
      <pre> 
        &lt;xsl:for-each select="//t:div[t:lg and @decls]"&gt;
           &lt;xsl:if test="count(.//t:lg/t:l)=14"&gt;
              &lt;!-- script's got to do what a script's got to do --&gt;
           &lt;/xsl:if&gt;
        &lt;/xsl:for-each&gt;
        </pre>
      <p>So we iterate over all <kbd>&lt;div&gt;...&lt;/div&gt;</kbd>s having
        line groups inside and have a `@decls` attribute containing a
        reference to metadata in the TEI header.
        The latter is not universal, but we use it in ADL and that attribute is only set on pieces that a cataloger has designated as a <em>work.</em>
        The decisions as to what is a work was based on the experience of what library patrons ask for at the information desk.
        I have implemented this using the shell script <a href="https://github.com/siglun/danish-sonnets/blob/main/find_sonnet_candidates.sh">find_sonnet_candidates.sh</a> and a transform <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidate.xsl">sonnet_candidate.xsl</a>.
        Finally, we don't do anything unless there are 14 lines of poetry.</p>
      <p>This transformation creates a long, <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml</a>, table with data about
        the sonnet candidates it finds.</p>
      <h2>Approximately pentametric</h2>
      <p>Finding <kbd> &lt;div&gt;...&lt;/div&gt;</kbd>s having 14 lines of poetry isn't good
        enough. We are expecting iambic pentameter, don't we? To actually analyse
        the texts for their rythmical properties is beyond me, but we could
        make an approximation.</p>
      <p>Iambic verse consists of feet with two syllables, i.e. if there are
        five feet per line we could say that iambic verse has approximately 10
        vowels per line. It is an approximation since a iamb should have the
        stress on the second syllable (due to ignorance I ignore the musical
        aspect of this; we will include false positives since lines of poetry
        with five feet must not be <strong>iambic.</strong>
      </p>
      <p>Any way, this script calculates the average number of
        vowels per line in poems with 14 lines:</p>
      <pre> 
        &lt;xsl:variable name="vowel_numbers" as="xs:integer *"&gt;
           &lt;xsl:for-each select=".//t:lg/t:l"&gt;
              &lt;xsl:variable name="vowels"&gt;
                 &lt;xsl:value-of select="replace(.,'[^iyeæøauoå]','')"/&gt;
              &lt;/xsl:variable&gt;
              &lt;xsl:value-of select="string-length($vowels)"/&gt;
           &lt;/xsl:for-each&gt;
        &lt;/xsl:variable&gt;
        &lt;xsl:value-of select="format-number(sum($vowel_numbers) div 14, '#.####')"/&gt;
        </pre>
      <p>We use the replace function and a regular expression to
        remove everything in each line except the vowels. Then we
        measure the string length which should equal the number of
        vowels per line and add them together for all lines in the
        poem. Finally we divide that sum with 14 and get the average
        number of vowels per line.</p>
      <p>For a sonnet it would be about 10,
        <a href="#hendecasyllable">or occasionally a little more.</a>
        Danish is a language rich in diftons,
        which could be another reason for lines deviating from the expected 10 vowels.
        In the Michaëlis poem quoted above it is 10.4.</p>
      <h2>Strophe structure</h2>
      <p>You can write a lot of nice poetry with 14 lines.
        Like Gustaf Munch-Petersen's <a href="https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid62017">en borgers livshymne</a> with one strophe with one line,
        then three strophes with four lines and finally a single line.
        The number of syllables per line seem to decrease towards the end.
        Gustaf was a modernist. There are no fixed structures and very few rhymes i his poetry.</p>
      <p>You can easily find out the strophe structure for each poem:</p>
      <pre> 
        &lt;xsl:variable name="lines_per_strophe" as="xs:integer *"&gt;
           &lt;xsl:for-each select=".//t:lg[t:l]"&gt;
              &lt;xsl:value-of select="count(t:l)"/&gt;
           &lt;/xsl:for-each&gt;
        &lt;/xsl:variable&gt;
        &lt;xsl:value-of select="$lines_per_strophe"/&gt;
        </pre>
      <p>That is, iterate over the line groups in a poem, and count the lines
        in each of them.</p>
      <p>I have summarized these data about all poems in ADL with 14lines.
        There are 243 of them (there might be more, but then they have erroneous markup).</p>
      <p>You find these sonnet candidates in a table here <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml.</a>
        Please, find an extract from it below.</p>
      <table>
         <h2>sonnet candidates</h2>
         <tr>
            <th>File name (link to source)</th>
            <th>Title (link to view)</th>
            <th>xml:id</th>
            <th>metadata reference</th>
            <th>Strophe structure</th>
            <th>average number of vowels per line</th>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid73888">Jeg havde faaet Brev fra dig, Nanette</a>
            </td>
            <td>workid73888</td>
            <td>#biblid73888</td>
            <td>4 4 3 3</td>
            <td>11.0</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid75376">Tag dette Kys, og tusind til, du Søde ...</a>
            </td>
            <td>workid75376</td>
            <td>#biblid75376</td>
            <td>4 4 3 3</td>
            <td>11.0714</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid76444">Sonet</a>
            </td>
            <td>workid76444</td>
            <td>#biblid76444</td>
            <td>4 4 3 3</td>
            <td>11.5</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/./brorson03grval.xml">./brorson03grval.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-brorson03grval-shoot-workid76607">1.</a>
            </td>
            <td>workid76607</td>
            <td>#biblid76607</td>
            <td>14</td>
            <td>8.7143</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid63580">SKUMRING</a>
            </td>
            <td>workid63580</td>
            <td>#biblid63580</td>
            <td>14</td>
            <td>10.8571</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid66036">TAAGE OG REGNDAGE</a>
            </td>
            <td>workid66036</td>
            <td>#biblid66036</td>
            <td>4 4 3 3</td>
            <td>13.9286</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid66131">MAANENS TUNGSIND</a>
            </td>
            <td>workid66131</td>
            <td>#biblid66131</td>
            <td>4 4 3 3</td>
            <td>13.8571</td>
         </tr>
         <tr>
            <td>
               <a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/jacobjp08val.xml">./jacobjp08val.xml</a>
            </td>
            <td>
               <a href="https://tekster.kb.dk/text/adl-texts-jacobjp08val-shoot-workid63094">I Seraillets Have</a>
            </td>
            <td>workid63094</td>
            <td>#biblid63094</td>
            <td>14</td>
            <td>6.7143</td>
         </tr>
      </table>
      <p>Sophus Claussen's first poem may or may not be a sonnet,
        Brorson's poem is not. All of those with strophe structure 4
        4 3 3 are definitely sonnets, as implied by strophe
        structure and the "approximately pentametric" number of
        vowels per line (and, by the way, Aarestrup often points out
        that he is actually writing sonnets in text or titles).</p>
      <h2>Then we have the rhymes</h2>
      <p>Beauty is in the eye of the beholder, says Shakespeare. I believe that
        he is right. Then, however, I would like to add that the rhymes and
        meters of poetry (like the pentameter) is in the ear of listener. It
        is time consuming to read houndreds of poems aloud and figure out the
        rhyme structure. So an approximate idea of the rhymes could be have
        comparing the verse line endings.</p>
      <p>This is error prone, though. Consider this <a href="https://tekster.kb.dk/text/adl-texts-moeller01val-shoot-workid62307">sonnet by P.M. Møller</a>.</p>
      <div id="">
         <p>
            <small>SONET</small>
         </p>
         <p>
            Den Svend, som Tabet af sin elskte frister,<br/>
            Vildfremmed vanker om blandt Jordens Hytter;<br/>
            Med Haab han efter Kirkeklokken lytter,<br/>
            Som lover ham igen, hvad her han mister.<br/>
         </p>
         <p>
            Men næppe han med en usalig bytter,<br/>
            Hvis Hjerte, stedse koldt for Elskov, brister,<br/>
            Som sig uelsket gennem Livet lister,<br/>
          Hans Armod kun mod Tabet ham beskytter.<br/>
         </p>
         <p>
            Til Livets Gaade rent han savner Nøglen,<br/>
            Hver Livets Blomst i Hjærtets Vinter fryser,<br/>
            Han gaar omkring med underlige Fagter.<br/>
         </p>
         <p>
            Ræd, Spøgelser han ser, naar Solen lyser,<br/>
            Modløs og syg, foragtet han foragter<br/>
            Det skønne Liv som tom og ussel Gøglen.<br/>
         </p>
      </div>
      <p>The the last syllable of the eight first lines are the same '-ter'. If
        you use some script to compare the endings you'll only find single
        syllable rhymes and miss double syllable ones rhymes. I.e., you can
        erroneously categorize feminine rhymes (with two syllables) as
        masculine ones (with one syllable). (Sorry, I don't know a
        politically correct vocabulary for these concepts.)</p>
      <p>In order to understand what we hear when reading, we have to consider
        '-ister' and '-ytter'. I.e., it starts with rhyme structure 'abbabaab'
        not 'aaaaaaaa'. Furthermore, it continues 'cdedec'.</p>
      <p>I have written a set of scripts that traverse the
        <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml</a>
        table.
        Transform that file using <a href="https://github.com/siglun/danish-sonnets/blob/main/iterate_the_rhyming.xsl">iterate_the_rhyming.xsl</a>
        selects poems with 14 lines and strophe structure 4 4 3 3.
        It generates a shell script which when executed pipes the content through other scripts that retrieve content,
        remove punctuation and finally detags them.
        The actual text is then piped through a perl script that
        analyse the endings according to the silly and flawed method described
        above.</p>
      <p>It works, sort of, until it doesn't. For poems with 4
        4 3 3 strophe structure, you can find the result in <a href="https://github.com/siglun/danish-sonnets/blob/main/rhymes_3chars.text">rhymes_3chars.text</a> and <a href="https://github.com/siglun/danish-sonnets/blob/main/rhymes_2chars.text">rhymes_2chars.text</a> for three
        and two letter rhymes, respectively. Run </p>
      <pre> 
        grep -P '^[a-q]{14}' rhymes_3chars.text   | sort | uniq -c | sort -rn
        </pre>
      <p>to get a list of rhyme structure and their frequencies. The rhyme
        structures that occur more than twice are:</p>
      <pre>
        6 abbaabbacdecde
        5 abbaabbacdcdcd
        4 abcaadeafgghii
        4 abbaabbacdcede
        3 abcaadeafghgig
        </pre>
      <p>This silly algorithm does actually give two of the most common rhyme structure
        for sonnets, but misses a lot of order in the remaining chaos:</p>
      <pre>abbaabbacdcdcd</pre>
      <p>and</p>
      <pre>abbaabbacdecde</pre>
      <p>So while it may fail more often than it succeeds, the successes give
        results that are reasonable.</p>
      <p>The rhyme structure abbaabbacdecde is one is the most
        common ones found.  Also it is one of the socalled Petrarchan
        rhyme schemes (<a href="#everysonnet">Eberhart, 2018</a>).</p>
      <h2>What are the sonnets about?</h2>
      <p>Any piece of art is meant to be consumed by humans. Poems should
        ideally be understood when read aloud and listened to. By humans.</p>
      <p>The cliché says that art and literature is about what it means to be
        human. Could we therefore hypothesize that the sonnets address this
        from the point of view of dead Danish male poets who wrote sonnets
        some 100 – 200 years ago?</p>
      <p>Assume that, at least as a first approximation, the words chosen by
        poets mirror those subjects. For instance, if being human implies
        lethality, we could, on a statistical level hypothesize that words like
        "mourning", "grief", "death", "grave", etc appear in the sonnet corpus
        more than in a random sample of text. The opposites would also be
        expected: Concepts related to "love", "birth", "compassion" belong
        to the sphere of being human.</p>
      <p>I have detagged the poems with 14 lines and strophe structure 4 4 3 3,
        tokenized their texts and calculated the word frequencies. As a matter
        of fact, I've done that in two ways:</p>
      <p>(i) The first being doing a classical tokenization followed by 
        piping the stuff through</p>
      <pre> 
        sort | uniq -c | sort -n
        </pre>
      <p>such that I get a list of the 4781 Danish words that are used in our
        sonnet sample, sorted by their frequencies.</p>
      <p>(ii) The second way is the same, but I do it twice, once for each
        sonnet such that I get a list of words for each sonnet. Then I repeat
        that for the concatenated lists for all sonnets.</p>
      <p>This means that I get </p>
      <ul>
         <li>one list of word frequencies in the entire sample and </li>
         <li>a second list giving not of the number of occurences of each word, but the number of sonnets the word occurs in.</li>
      </ul>
      <p>There are 160 sonnets in the selection, and the most frequent word occurs in all of them.
        These are the fifteen most commont word measured by the <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">number of sonnets they occur in</a>.
        Number of poems in the left column.</p>
      <pre> 
        75 du
        76 sig
        82 er
        85 jeg
        86 det
        89 for
        94 den
        101 paa
        104 en
        105 af
        106 til
        119 som
        122 med
        150 i
        160 og
        </pre>
      <p>and this is the list of the same thing,
        but measured as the grand total <a href="https://github.com/siglun/danish-sonnets/blob/main/frequencies.text">occurrence of the words in the corpus</a>.
        Number of words in corpus in left column.</p>
      <pre> 
        109 min
        130 for
        144 du
        148 er
        155 paa
        164 til
        167 det
        169 den
        173 af
        206 en
        217 med
        229 som
        246 jeg
        382 i
        588 og
        </pre>
      <p>As you can see this corroborates the established observation that the
        most frequent words in a corpus hardly ever describes the subject
        matter of texts (the words are conjunctions, pronouns,
        prepositions and the like). The distribution of the number of sonnets
        the words appear in:</p>
      <div id="">
         <img src="https://github.com/siglun/danish-sonnets/raw/main/distro.png"/>
      </div>
      <p>The distribution shows number of words graphed against
        number of sonnets.  There are 3304 words occurring in just one
        sonnet. The leftmost, and highest, point on the graph has the
        coordinate (1,3304).</p>
      <p>There is just one word appearing in all 160 sonnets. It is
        'og' meaning 'and' corresponding to the rightmost point on the
        graph which has the coordinate (160,1). As a rule of thumb the
        most common words are all conjunctions, next to them comes
        prepositions and after those come pronomina.</p>
      <p>The <a href="https://github.com/siglun/danish-sonnets/blob/main/distribution.text">distribution.text</a>
        is generated from <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">poem_frequencies.text</a>
        using (the line has been folded)</p>
      <pre> 
        sed 's/\ [a-z]*$//' poem_frequencies.text | sort | uniq -c | 
        sort -n -k 2 &gt; distribution.text
        </pre>
      <p>See above. Column 1 is plotted against column 2.</p>
      <p>In this particular corpus, it seems that <strong>aboutishness</strong> start at words occuring in about 25% of the sonnets, or less.
        I.e., words occuring in 40 sonnets, or fewer.</p>
      <p>In what follows,
        I have simply used the utility <kbd>grep</kbd> find words and derivates in the file <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">poem_frequencies.text</a> mentioned above.</p>
      <p>As example we have death, dead and lethal etc (basically
        words containing <em>død</em>) in a number of
        sonnets. In the left column the number of sonnets containing
        the word. These appear in about 7% of the sonnets.</p>
      <pre> 
        1 dødehavet
        1 dødeklokker
        1 dødelige
        1 dødeliges
        1 dødningvuggeqvad
        1 dødsberedthed
        1 glemselsdøden
        1 udødeliges
        2 dødes
        5 dødens
        9 død
        9 døden
        11 døde
        </pre>
      <p>There are interesting derivatives and compound words on the list.
        Like <em>dødsberedthed</em> meaning preparedness for death.
        <em>Glemselsdøden</em> refers, I believe, to the death or disappearance due
        to the disappearance of traces or memories of someone who belonged to generations.</p>
      <p>Love (elskov) is not as popular as death (about 5% of the sonnets).</p>
      <pre> 
        1 elskoven
        1 elskovsbrev
        1 elskovsbrevet
        2 elskovsild
        6 elskovs
        7 elskov
        </pre>
      <p>
         <em>elskovsild</em> means the fire of
        love. <em>elskovsbrev</em> has to be love
        letter. <em>women (kvinde)</em> are not as
        popular as love</p>
      <pre> 
        1 dobbeltkvinde
        1 kvindens
        1 kvindetække
        4 kvinder
        </pre>
      <p>Men more than women, and in particular words implying bravery and male virtues</p>
      <pre> 
        1 baadsmandstrille
        1 dobbeltmand
        1 ejermand
        1 manddom
        1 manddomstrods
        1 manden
        2 mand
        2 manddoms
        5 mandens
        </pre>
      <p>Remember that these sonnets are by men.
        mandom implies a man's existence as a grownup man.
        Originally,
        in <a href="#oldnorse">old norse</a>,
        mand meant,
        just as in Old English,
        human.
        That, however, was when it was doubtful if women were actually human.
        Baadsmandstrille is a derivative of baadsmand (boatswain) which is another name for a sailor or petty officer.
        A baadsmandstrille is presumably a song sung by sailors.</p>
      <p>Graves occur, for some reason, less than deaths</p>
      <pre> 
        1 begravet
        1 graven
        1 gravene
        1 gravhøi
        1 indgraves
        3 grav
        3 grave
        4 gravens
        </pre>
      <p>indgraves is most likely a kind of <em>homonym</em>, if you look up that sonnet it is
        clear that it means engrave. There both the verb in past tense
        begravet (buried) from begrave (as in bury) and grav (as in
        grave) and gravhøi (tumulus).</p>
      <h2>Conclusions</h2>
      <p>I think I could go on studying this for quite some
        time. However, I have to conclude this here, before the actual
        conclusions. There are interesting things to find here,
        though.
        Some of them are possible to study using simple methods,
        such as those described by <a href="#kennethchurch">Kenneth Ward Church</a>
        in his
        <em>Unix™ for Poets</em>.</p>
      <p>The preliminary result from my armchair text processing exercise supports the
        notion that life was already in early modern Europe about sex, death
        and rock n'roll. Since rock wasn't there just yet, people had to be
        content with sonnets for the time being.</p>
      <h2>References</h2>
      <p id="kennethchurch">
         <span class="biblAuthor">Church, Kenneth Ward</span>,
      [date unknown].
      <em>Unix™ for Poets</em>. <a href="https://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf">https://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf</a>
      </p>
      <p id="adlcorpus">
         <span class="biblAuthor">Det Kgl. Bibliotek</span>,  and <span class="biblAuthor">Det Danske Sprog- og Litteraturselskab</span>,
      2000 - 2022.
      <em>The ADL text corpus</em>. <a href="https://github.com/kb-dk/public-adl-text-sources">https://github.com/kb-dk/public-adl-text-sources</a>
      </p>
      <p id="everysonnet">
         <span class="biblAuthor">Eberhart, Larry</span>,
      2018.
      Italian or Petrarchan Sonnet. In: <em>Every Sonnet: The sonnet forms database</em>. <a href="https://poetscollective.org/everysonnet/tag/abbaabbacdecde/#post-119">https://poetscollective.org/everysonnet/tag/abbaabbacdecde/#post-119</a>
      </p>
      <p id="hendecasyllable"> Hendecasyllable. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Hendecasyllable">https://en.wikipedia.org/wiki/Hendecasyllable</a>
      </p>
      <p id="pentameter"> Iambic pentameter. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Iambic_pentameter">https://en.wikipedia.org/wiki/Iambic_pentameter</a>
      </p>
      <p id="sophus">
         <span class="biblAuthor">Michaëlis, Sophus</span>,
      1883.
      Jeg elsker —. In: <em>Solblomster</em>. <a href="https://tekster.kb.dk/text/adl-texts-michs_03-shoot-workid68251">https://tekster.kb.dk/text/adl-texts-michs_03-shoot-workid68251</a>
      </p>
      <p id="oldnorse"> Old Norse. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Old_Norse">https://en.wikipedia.org/wiki/Old_Norse</a>
      </p>
      <p id="sonnets"> Sonnet. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Sonnet">https://en.wikipedia.org/wiki/Sonnet</a>
      </p>
      <p id="teiguidelines">
         <span class="biblAuthor">The TEI Consortium</span>,
      2022.
      <em>TEI P5: Guidelines for Electronic Text Encoding and Interchange</em>. <a href="https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html">https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html</a>
      </p>
      <p id="tei-ref-lg">
         <span class="biblAuthor">The TEI Consortium</span>,
      2022.
      Passages of Verse or Drama. In: <em>TEI P5: Guidelines for Electronic Text Encoding and Interchange</em>. <a href="https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CODV">https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CODV</a>
      </p>
      <pre/>
   </body>
</html>