-
Notifications
You must be signed in to change notification settings - Fork 0
/
sonnet-analysis.html
839 lines (802 loc) · 32.6 KB
/
sonnet-analysis.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Sex, death and sonnetsMusings of a software developer</title>
<style type="text/css" media="all">
body {
color: black;
font-family: serif;
font-weight: normal;
font-size: 12pt;
margin-top: 5%;
margin-left: 5%;
margin-right: 5%;
}
blockquote.abstract {
font-style: normal;
width: 80%;
font-size: 90%;
text-align: left;
margin-left: +10%;
background-color: white;
}
blockquote {
font-style: italic;
text-align: left;
margin-left: +10%;
background-color: white;
}
dl, p, li {
text-align: left;
margin-left: +1%;
margin-right: +1%;
background-color: white;
}
p.navigation {
text-align: right;
background-color: white;
}
p.right {
text-align: right;
background-color: white;
}
div#table-of-contents {
position: fixed;
font-size:80%;
left:0.1em;
top:1.0em;
width: 16em;
height: 40em;
overflow: scroll;
background-color: white;
}
div#main-content {
position: absolute;
top: 0.5em;
left: 17em;
margin-right: +2%;
background-color: white;
}
div {
text-align: left;
margin-left: +2%;
background-color: white;
}
pre {
font-family: courier,fixed;
margin-left: +2%;
width: 95%;
}
p.author {
font-style: italic;
text-align: center;
background-color: white;
}
h1.title {
font-size: 120%;
font-family: sans-serif;
font-weight: bold;
text-align: center;
background-color: white;
margin-left: +2%;
margin-right: +2%;
}
.level1 {
background-color: white;
}
.level2 {
background-color: white;
}
.level3 {
background-color: white;
}
.level4 {
background-color: white;
}
h1 {
font-weight: bold;
font-family: sans-serif;
background-color: white;
margin-left: +2%;
margin-right: +2%;
font-size: 110%;
}
h2 {
font-family: sans-serif;
font-weight: bold;
background-color: white;
border-top: none;
border-bottom: solid thin;
border-right: none;
border-left: none;
font-size: 110%;
}
h3 {
font-family: sans-serif;
font-weight: bold;
background-color: white;
border-top: none;
border-bottom: dashed thin;
border-right: none;
border-left: none;
font-size: 110%;
}
h4 {
font-family: sans-serif;
font-weight: bold;
background-color: white;
border-top: none;
border-bottom: none;
border-right: none;
border-left: none;
font-size: 110%;
}
table {
text-align: left;
width: 80%;
color: black;
margin-left: +5%;
vertical-align: top;
}
caption {
text-align: center;
color: black;
background-color: #eeeeee;
font-weight: bold;
vertical-align: top;
margin-left: +5%;
}
td {
text-align: left;
color: black;
background-color: #ffffff;
vertical-align: top;
/*border-top: solid thin;
border-bottom: none;
border-right: none;
border-left: solid thin;*/
}
th {
text-align: left;
color: black;
background-color: #ffffff;
vertical-align: top;
}
table.navigation {
text-align: center;
color: black;
background-color: white;
vertical-align: top;
}
td.left {
text-align: left;
color: black;
background-color: white;
vertical-align: top;
border: none;
}
td.center {
text-align: center;
color: black;
background-color: white;
vertical-align: top;
border:none;
background-color: white;
}
td.right {
text-align: right;
color: black;
background-color: white;
vertical-align: top;
border:none;
background-color: white;
}
span.biblAuthor {
font-variant: small-caps;
}
span.red {
color: red;
}
a:active {
color: #660000;
}
a:link {
color: #000066;
}
a:visited {
color: #660000;
}
hr {
border-top: solid thin;
border-bottom: none;
border-right: none;
border-left: none;
}
p.version {
text-align: right;
width: 30%;
margin-left: 65%;
}
</style>
</head>
<body>
<h1 class="title">Sex, death and sonnets<br/>Musings of a software developer</h1>
<p class="author">Sigfrid Lundberg<br/>slu@kb.dk<br/>Digital Transformation<br/>Royal Danish Library<br/>Post box 2149<br/>1016 Copenhagen K<br/>Denmark<br/>
</p>
<blockquote class="abstract">
<h3>Abstract</h3>
<p>This note discusses how software can recognize sonnets, by
analysis of text length, strophe structure and number of syllables
per line. It also makes a simple content analysis based on
word frequency analyses.</p>
<p>The results clearly shows that simple Unix™ for Poets
analyses combines seamlessly with TEI markup and XML technologies.</p>
</blockquote>
<h2>Introduction</h2>
<p>If there are any sonnets, do they rhyme and what are they about?</p>
<p>I have since many years been a great fan of the tutorial <em>Unix™ for Poets</em> by <a href="#kennethchurch">Kenneth Ward Church.</a>
This note is an investigation of what can be done with a corpus of literary text with very simple tools similar to the ones described by Church in his tutorial.
I do not claim that there is anything novel or even significant in
this text. Being a scientist, I think like a scientist and don't
expect any deep literary theory here.</p>
<h2>Finding poems</h2>
<p>The ADL text corpus contains <a href="#adlcorpus">literary texts.</a>
Since the texts are encoded according to the <a href="#teiguidelines">TEI guidelines</a> it is easy to find poetry in those files.
Typically a piece of poetry is encoded as <a href="#tei-ref-lg">lines within line groups</a>.
More often than not the line groups are embedded in <kbd><div> ... </div></kbd> elements.</p>
<p>A poem may look like this in the source.
The poem is by <a href="#sophus">Sophus Michaëlis (1883).</a>
</p>
<pre>
<div decls="#biblid68251">
<head>Jeg elsker —</head>
<lg>
<l>Jeg elsker Himlens høje Harmoni,</l>
<l>dens Purpurblomst, som blaaner i det Fjærne,</l>
<l>den Fred, som risler ned fra Nattens Stjerne,</l>
<l>det Glimt af Gud, der glider mig forbi;</l>
</lg>
<lg>
<l>og Evighedens tavse Melodi,</l>
<l>de svundne Slægters kaldende Orkester,</l>
<l>et Tonehav om en usynlig Mester,</l>
<l>en Klang af Gud, der bruser mig forbi;</l>
</lg>
<lg>
<l>en magisk Magt fra Hjertets mørke Celle,</l>
<l>de stærke Længsler, som mod Lyset vælde,</l>
<l>Naturens evigunge Fantasi;</l>
</lg>
<lg>
<l>det Liv, der spirer midt i selve Døden,</l>
<l>den Sol, der stiger midt i Aftenrøden,</l>
<l>— o Glimt af Gud, der glider mig forbi!</l>
</lg>
<p>
<date>12. Septbr. 1893.</date>
</p>
</div>
</pre>
<p>The default name space is declared as
xmlns="http://www.tei-c.org/ns/1.0", which we in following refer to
with the namespace prefix 't'.</p>
<p>The poem comprises four line groups with four, four, three
and three lines. That is a very common strophe structure
(according to the <a href="#sonnets">Sonnets</a> article
in Wikipedia), at least in Scandinavia. It is not always like
that, but they all contain 14 lines.</p>
<p>Shakespeare wrote often his 14 lines typographically in one
strophe, whereas Francesco Petrarca wrote them in two strophes
with eight and six lines, respectively (again see article
<a href="#sonnets">Sonnets</a> in Wikipedia).</p>
<p>To be more precise, a sonnet has one more characteristics
than having 14 lines, the lines should be in <a href="#pentameter">iambic pentameter.</a>
</p>
<h2>Finding sonnets</h2>
<p>You can easily find all poems in the corpus based on a
XPATH query like:</p>
<pre>
//t:div[t:lg and @decls]
</pre>
<p>We can use that query in XSLT like this:</p>
<pre>
<xsl:for-each select="//t:div[t:lg and @decls]">
<xsl:if test="count(.//t:lg/t:l)=14">
<!-- script's got to do what a script's got to do -->
</xsl:if>
</xsl:for-each>
</pre>
<p>So we iterate over all <kbd><div>...</div></kbd>s having
line groups inside and have a `@decls` attribute containing a
reference to metadata in the TEI header.
The latter is not universal, but we use it in ADL and that attribute is only set on pieces that a cataloger has designated as a <em>work.</em>
The decisions as to what is a work was based on the experience of what library patrons ask for at the information desk.
I have implemented this using the shell script <a href="https://github.com/siglun/danish-sonnets/blob/main/find_sonnet_candidates.sh">find_sonnet_candidates.sh</a> and a transform <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidate.xsl">sonnet_candidate.xsl</a>.
Finally, we don't do anything unless there are 14 lines of poetry.</p>
<p>This transformation creates a long, <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml</a>, table with data about
the sonnet candidates it finds.</p>
<h2>Approximately pentametric</h2>
<p>Finding <kbd> <div>...</div></kbd>s having 14 lines of poetry isn't good
enough. We are expecting iambic pentameter, don't we? To actually analyse
the texts for their rythmical properties is beyond me, but we could
make an approximation.</p>
<p>Iambic verse consists of feet with two syllables, i.e. if there are
five feet per line we could say that iambic verse has approximately 10
vowels per line. It is an approximation since a iamb should have the
stress on the second syllable (due to ignorance I ignore the musical
aspect of this; we will include false positives since lines of poetry
with five feet must not be <strong>iambic.</strong>
</p>
<p>Any way, this script calculates the average number of
vowels per line in poems with 14 lines:</p>
<pre>
<xsl:variable name="vowel_numbers" as="xs:integer *">
<xsl:for-each select=".//t:lg/t:l">
<xsl:variable name="vowels">
<xsl:value-of select="replace(.,'[^iyeæøauoå]','')"/>
</xsl:variable>
<xsl:value-of select="string-length($vowels)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="format-number(sum($vowel_numbers) div 14, '#.####')"/>
</pre>
<p>We use the replace function and a regular expression to
remove everything in each line except the vowels. Then we
measure the string length which should equal the number of
vowels per line and add them together for all lines in the
poem. Finally we divide that sum with 14 and get the average
number of vowels per line.</p>
<p>For a sonnet it would be about 10,
<a href="#hendecasyllable">or occasionally a little more.</a>
Danish is a language rich in diftons,
which could be another reason for lines deviating from the expected 10 vowels.
In the Michaëlis poem quoted above it is 10.4.</p>
<h2>Strophe structure</h2>
<p>You can write a lot of nice poetry with 14 lines.
Like Gustaf Munch-Petersen's <a href="https://tekster.kb.dk/text/adl-texts-munp1-shoot-workid62017">en borgers livshymne</a> with one strophe with one line,
then three strophes with four lines and finally a single line.
The number of syllables per line seem to decrease towards the end.
Gustaf was a modernist. There are no fixed structures and very few rhymes i his poetry.</p>
<p>You can easily find out the strophe structure for each poem:</p>
<pre>
<xsl:variable name="lines_per_strophe" as="xs:integer *">
<xsl:for-each select=".//t:lg[t:l]">
<xsl:value-of select="count(t:l)"/>
</xsl:for-each>
</xsl:variable>
<xsl:value-of select="$lines_per_strophe"/>
</pre>
<p>That is, iterate over the line groups in a poem, and count the lines
in each of them.</p>
<p>I have summarized these data about all poems in ADL with 14lines.
There are 243 of them (there might be more, but then they have erroneous markup).</p>
<p>You find these sonnet candidates in a table here <a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml.</a>
Please, find an extract from it below.</p>
<table>
<h2>sonnet candidates</h2>
<tr>
<th>File name (link to source)</th>
<th>Title (link to view)</th>
<th>xml:id</th>
<th>metadata reference</th>
<th>Strophe structure</th>
<th>average number of vowels per line</th>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid73888">Jeg havde faaet Brev fra dig, Nanette</a>
</td>
<td>workid73888</td>
<td>#biblid73888</td>
<td>4 4 3 3</td>
<td>11.0</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid75376">Tag dette Kys, og tusind til, du Søde ...</a>
</td>
<td>workid75376</td>
<td>#biblid75376</td>
<td>4 4 3 3</td>
<td>11.0714</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/aarestrup07val.xml">./aarestrup07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-aarestrup07val-shoot-workid76444">Sonet</a>
</td>
<td>workid76444</td>
<td>#biblid76444</td>
<td>4 4 3 3</td>
<td>11.5</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/./brorson03grval.xml">./brorson03grval.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-brorson03grval-shoot-workid76607">1.</a>
</td>
<td>workid76607</td>
<td>#biblid76607</td>
<td>14</td>
<td>8.7143</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid63580">SKUMRING</a>
</td>
<td>workid63580</td>
<td>#biblid63580</td>
<td>14</td>
<td>10.8571</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid66036">TAAGE OG REGNDAGE</a>
</td>
<td>workid66036</td>
<td>#biblid66036</td>
<td>4 4 3 3</td>
<td>13.9286</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/claussen07val.xml">./claussen07val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-claussen07val-shoot-workid66131">MAANENS TUNGSIND</a>
</td>
<td>workid66131</td>
<td>#biblid66131</td>
<td>4 4 3 3</td>
<td>13.8571</td>
</tr>
<tr>
<td>
<a href="https://github.com/kb-dk/public-adl-text-sources/blob/master/texts/jacobjp08val.xml">./jacobjp08val.xml</a>
</td>
<td>
<a href="https://tekster.kb.dk/text/adl-texts-jacobjp08val-shoot-workid63094">I Seraillets Have</a>
</td>
<td>workid63094</td>
<td>#biblid63094</td>
<td>14</td>
<td>6.7143</td>
</tr>
</table>
<p>Sophus Claussen's first poem may or may not be a sonnet,
Brorson's poem is not. All of those with strophe structure 4
4 3 3 are definitely sonnets, as implied by strophe
structure and the "approximately pentametric" number of
vowels per line (and, by the way, Aarestrup often points out
that he is actually writing sonnets in text or titles).</p>
<h2>Then we have the rhymes</h2>
<p>Beauty is in the eye of the beholder, says Shakespeare. I believe that
he is right. Then, however, I would like to add that the rhymes and
meters of poetry (like the pentameter) is in the ear of listener. It
is time consuming to read houndreds of poems aloud and figure out the
rhyme structure. So an approximate idea of the rhymes could be have
comparing the verse line endings.</p>
<p>This is error prone, though. Consider this <a href="https://tekster.kb.dk/text/adl-texts-moeller01val-shoot-workid62307">sonnet by P.M. Møller</a>.</p>
<div id="">
<p>
<small>SONET</small>
</p>
<p>
Den Svend, som Tabet af sin elskte frister,<br/>
Vildfremmed vanker om blandt Jordens Hytter;<br/>
Med Haab han efter Kirkeklokken lytter,<br/>
Som lover ham igen, hvad her han mister.<br/>
</p>
<p>
Men næppe han med en usalig bytter,<br/>
Hvis Hjerte, stedse koldt for Elskov, brister,<br/>
Som sig uelsket gennem Livet lister,<br/>
Hans Armod kun mod Tabet ham beskytter.<br/>
</p>
<p>
Til Livets Gaade rent han savner Nøglen,<br/>
Hver Livets Blomst i Hjærtets Vinter fryser,<br/>
Han gaar omkring med underlige Fagter.<br/>
</p>
<p>
Ræd, Spøgelser han ser, naar Solen lyser,<br/>
Modløs og syg, foragtet han foragter<br/>
Det skønne Liv som tom og ussel Gøglen.<br/>
</p>
</div>
<p>The the last syllable of the eight first lines are the same '-ter'. If
you use some script to compare the endings you'll only find single
syllable rhymes and miss double syllable ones rhymes. I.e., you can
erroneously categorize feminine rhymes (with two syllables) as
masculine ones (with one syllable). (Sorry, I don't know a
politically correct vocabulary for these concepts.)</p>
<p>In order to understand what we hear when reading, we have to consider
'-ister' and '-ytter'. I.e., it starts with rhyme structure 'abbabaab'
not 'aaaaaaaa'. Furthermore, it continues 'cdedec'.</p>
<p>I have written a set of scripts that traverse the
<a href="https://github.com/siglun/danish-sonnets/blob/main/sonnet_candidates.xml">sonnet_candidates.xml</a>
table.
Transform that file using <a href="https://github.com/siglun/danish-sonnets/blob/main/iterate_the_rhyming.xsl">iterate_the_rhyming.xsl</a>
selects poems with 14 lines and strophe structure 4 4 3 3.
It generates a shell script which when executed pipes the content through other scripts that retrieve content,
remove punctuation and finally detags them.
The actual text is then piped through a perl script that
analyse the endings according to the silly and flawed method described
above.</p>
<p>It works, sort of, until it doesn't. For poems with 4
4 3 3 strophe structure, you can find the result in <a href="https://github.com/siglun/danish-sonnets/blob/main/rhymes_3chars.text">rhymes_3chars.text</a> and <a href="https://github.com/siglun/danish-sonnets/blob/main/rhymes_2chars.text">rhymes_2chars.text</a> for three
and two letter rhymes, respectively. Run </p>
<pre>
grep -P '^[a-q]{14}' rhymes_3chars.text | sort | uniq -c | sort -rn
</pre>
<p>to get a list of rhyme structure and their frequencies. The rhyme
structures that occur more than twice are:</p>
<pre>
6 abbaabbacdecde
5 abbaabbacdcdcd
4 abcaadeafgghii
4 abbaabbacdcede
3 abcaadeafghgig
</pre>
<p>This silly algorithm does actually give two of the most common rhyme structure
for sonnets, but misses a lot of order in the remaining chaos:</p>
<pre>abbaabbacdcdcd</pre>
<p>and</p>
<pre>abbaabbacdecde</pre>
<p>So while it may fail more often than it succeeds, the successes give
results that are reasonable.</p>
<p>The rhyme structure abbaabbacdecde is one is the most
common ones found. Also it is one of the socalled Petrarchan
rhyme schemes (<a href="#everysonnet">Eberhart, 2018</a>).</p>
<h2>What are the sonnets about?</h2>
<p>Any piece of art is meant to be consumed by humans. Poems should
ideally be understood when read aloud and listened to. By humans.</p>
<p>The cliché says that art and literature is about what it means to be
human. Could we therefore hypothesize that the sonnets address this
from the point of view of dead Danish male poets who wrote sonnets
some 100 – 200 years ago?</p>
<p>Assume that, at least as a first approximation, the words chosen by
poets mirror those subjects. For instance, if being human implies
lethality, we could, on a statistical level hypothesize that words like
"mourning", "grief", "death", "grave", etc appear in the sonnet corpus
more than in a random sample of text. The opposites would also be
expected: Concepts related to "love", "birth", "compassion" belong
to the sphere of being human.</p>
<p>I have detagged the poems with 14 lines and strophe structure 4 4 3 3,
tokenized their texts and calculated the word frequencies. As a matter
of fact, I've done that in two ways:</p>
<p>(i) The first being doing a classical tokenization followed by
piping the stuff through</p>
<pre>
sort | uniq -c | sort -n
</pre>
<p>such that I get a list of the 4781 Danish words that are used in our
sonnet sample, sorted by their frequencies.</p>
<p>(ii) The second way is the same, but I do it twice, once for each
sonnet such that I get a list of words for each sonnet. Then I repeat
that for the concatenated lists for all sonnets.</p>
<p>This means that I get </p>
<ul>
<li>one list of word frequencies in the entire sample and </li>
<li>a second list giving not of the number of occurences of each word, but the number of sonnets the word occurs in.</li>
</ul>
<p>There are 160 sonnets in the selection, and the most frequent word occurs in all of them.
These are the fifteen most commont word measured by the <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">number of sonnets they occur in</a>.
Number of poems in the left column.</p>
<pre>
75 du
76 sig
82 er
85 jeg
86 det
89 for
94 den
101 paa
104 en
105 af
106 til
119 som
122 med
150 i
160 og
</pre>
<p>and this is the list of the same thing,
but measured as the grand total <a href="https://github.com/siglun/danish-sonnets/blob/main/frequencies.text">occurrence of the words in the corpus</a>.
Number of words in corpus in left column.</p>
<pre>
109 min
130 for
144 du
148 er
155 paa
164 til
167 det
169 den
173 af
206 en
217 med
229 som
246 jeg
382 i
588 og
</pre>
<p>As you can see this corroborates the established observation that the
most frequent words in a corpus hardly ever describes the subject
matter of texts (the words are conjunctions, pronouns,
prepositions and the like). The distribution of the number of sonnets
the words appear in:</p>
<div id="">
<img src="https://github.com/siglun/danish-sonnets/raw/main/distro.png"/>
</div>
<p>The distribution shows number of words graphed against
number of sonnets. There are 3304 words occurring in just one
sonnet. The leftmost, and highest, point on the graph has the
coordinate (1,3304).</p>
<p>There is just one word appearing in all 160 sonnets. It is
'og' meaning 'and' corresponding to the rightmost point on the
graph which has the coordinate (160,1). As a rule of thumb the
most common words are all conjunctions, next to them comes
prepositions and after those come pronomina.</p>
<p>The <a href="https://github.com/siglun/danish-sonnets/blob/main/distribution.text">distribution.text</a>
is generated from <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">poem_frequencies.text</a>
using (the line has been folded)</p>
<pre>
sed 's/\ [a-z]*$//' poem_frequencies.text | sort | uniq -c |
sort -n -k 2 > distribution.text
</pre>
<p>See above. Column 1 is plotted against column 2.</p>
<p>In this particular corpus, it seems that <strong>aboutishness</strong> start at words occuring in about 25% of the sonnets, or less.
I.e., words occuring in 40 sonnets, or fewer.</p>
<p>In what follows,
I have simply used the utility <kbd>grep</kbd> find words and derivates in the file <a href="https://github.com/siglun/danish-sonnets/blob/main/poem_frequencies.text">poem_frequencies.text</a> mentioned above.</p>
<p>As example we have death, dead and lethal etc (basically
words containing <em>død</em>) in a number of
sonnets. In the left column the number of sonnets containing
the word. These appear in about 7% of the sonnets.</p>
<pre>
1 dødehavet
1 dødeklokker
1 dødelige
1 dødeliges
1 dødningvuggeqvad
1 dødsberedthed
1 glemselsdøden
1 udødeliges
2 dødes
5 dødens
9 død
9 døden
11 døde
</pre>
<p>There are interesting derivatives and compound words on the list.
Like <em>dødsberedthed</em> meaning preparedness for death.
<em>Glemselsdøden</em> refers, I believe, to the death or disappearance due
to the disappearance of traces or memories of someone who belonged to generations.</p>
<p>Love (elskov) is not as popular as death (about 5% of the sonnets).</p>
<pre>
1 elskoven
1 elskovsbrev
1 elskovsbrevet
2 elskovsild
6 elskovs
7 elskov
</pre>
<p>
<em>elskovsild</em> means the fire of
love. <em>elskovsbrev</em> has to be love
letter. <em>women (kvinde)</em> are not as
popular as love</p>
<pre>
1 dobbeltkvinde
1 kvindens
1 kvindetække
4 kvinder
</pre>
<p>Men more than women, and in particular words implying bravery and male virtues</p>
<pre>
1 baadsmandstrille
1 dobbeltmand
1 ejermand
1 manddom
1 manddomstrods
1 manden
2 mand
2 manddoms
5 mandens
</pre>
<p>Remember that these sonnets are by men.
mandom implies a man's existence as a grownup man.
Originally,
in <a href="#oldnorse">old norse</a>,
mand meant,
just as in Old English,
human.
That, however, was when it was doubtful if women were actually human.
Baadsmandstrille is a derivative of baadsmand (boatswain) which is another name for a sailor or petty officer.
A baadsmandstrille is presumably a song sung by sailors.</p>
<p>Graves occur, for some reason, less than deaths</p>
<pre>
1 begravet
1 graven
1 gravene
1 gravhøi
1 indgraves
3 grav
3 grave
4 gravens
</pre>
<p>indgraves is most likely a kind of <em>homonym</em>, if you look up that sonnet it is
clear that it means engrave. There both the verb in past tense
begravet (buried) from begrave (as in bury) and grav (as in
grave) and gravhøi (tumulus).</p>
<h2>Conclusions</h2>
<p>I think I could go on studying this for quite some
time. However, I have to conclude this here, before the actual
conclusions. There are interesting things to find here,
though.
Some of them are possible to study using simple methods,
such as those described by <a href="#kennethchurch">Kenneth Ward Church</a>
in his
<em>Unix™ for Poets</em>.</p>
<p>The preliminary result from my armchair text processing exercise supports the
notion that life was already in early modern Europe about sex, death
and rock n'roll. Since rock wasn't there just yet, people had to be
content with sonnets for the time being.</p>
<h2>References</h2>
<p id="kennethchurch">
<span class="biblAuthor">Church, Kenneth Ward</span>,
[date unknown].
<em>Unix™ for Poets</em>. <a href="https://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf">https://web.stanford.edu/class/cs124/kwc-unix-for-poets.pdf</a>
</p>
<p id="adlcorpus">
<span class="biblAuthor">Det Kgl. Bibliotek</span>, and <span class="biblAuthor">Det Danske Sprog- og Litteraturselskab</span>,
2000 - 2022.
<em>The ADL text corpus</em>. <a href="https://github.com/kb-dk/public-adl-text-sources">https://github.com/kb-dk/public-adl-text-sources</a>
</p>
<p id="everysonnet">
<span class="biblAuthor">Eberhart, Larry</span>,
2018.
Italian or Petrarchan Sonnet. In: <em>Every Sonnet: The sonnet forms database</em>. <a href="https://poetscollective.org/everysonnet/tag/abbaabbacdecde/#post-119">https://poetscollective.org/everysonnet/tag/abbaabbacdecde/#post-119</a>
</p>
<p id="hendecasyllable"> Hendecasyllable. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Hendecasyllable">https://en.wikipedia.org/wiki/Hendecasyllable</a>
</p>
<p id="pentameter"> Iambic pentameter. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Iambic_pentameter">https://en.wikipedia.org/wiki/Iambic_pentameter</a>
</p>
<p id="sophus">
<span class="biblAuthor">Michaëlis, Sophus</span>,
1883.
Jeg elsker —. In: <em>Solblomster</em>. <a href="https://tekster.kb.dk/text/adl-texts-michs_03-shoot-workid68251">https://tekster.kb.dk/text/adl-texts-michs_03-shoot-workid68251</a>
</p>
<p id="oldnorse"> Old Norse. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Old_Norse">https://en.wikipedia.org/wiki/Old_Norse</a>
</p>
<p id="sonnets"> Sonnet. In: <em>Wikipedia</em>. <a href="https://en.wikipedia.org/wiki/Sonnet">https://en.wikipedia.org/wiki/Sonnet</a>
</p>
<p id="teiguidelines">
<span class="biblAuthor">The TEI Consortium</span>,
2022.
<em>TEI P5: Guidelines for Electronic Text Encoding and Interchange</em>. <a href="https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html">https://tei-c.org/release/doc/tei-p5-doc/en/html/index.html</a>
</p>
<p id="tei-ref-lg">
<span class="biblAuthor">The TEI Consortium</span>,
2022.
Passages of Verse or Drama. In: <em>TEI P5: Guidelines for Electronic Text Encoding and Interchange</em>. <a href="https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CODV">https://tei-c.org/release/doc/tei-p5-doc/en/html/CO.html#CODV</a>
</p>
<pre/>
</body>
</html>