Skip to content

jocelynpender/carex-climate-morpho

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

carex-climate-morpho

Methods

We generated our morphological dataset using Flora of China and Flora of North America Cyperaceae species treatments (including varieties and subspecies). These treatments are composed by botanical experts who construct composite morphological descriptions (see http://floranorthamerica.org/Introduction for more information) by examining a plethora of specimens. Flora of China (FoC; eFloras, 2008; http://www.efloras.org/flora_page.aspx?flora_id=2) and Flora of North America (FNA; Flora of North America Editorial Committee, 1993+; http://floranorthamerica.org) treatments were encoded into XML (Extensible Markup Language) files with minimal document structure added.

The XML files containing the treatments were processed by CharaParser (Cui, 2012; FoC parsed in 2013; FNA parsed using CharaParser version 0.1.196 in 2020). CharaParser is a tool built explicitly to annotate morphological descriptions using an unsupervised machine learning algorithm and a general purpose syntactic parser (Cui, 2012). CharaParser turns morphological descriptions into fine-grained annotated XML documents by identifying organs, characters of organs, measurements, etc. For example, the sentence "Rhizomes 3–5 mm thick." is transformed into the following markup "<biological_entity id="o20841" name="rhizome" name_original="rhizomes" src="d0_s0" type="structure"> </biological_entity>". The XML files used in this project can be seen in our GitHub repo: https://github.com/jocelynpender/carex-climate-morpho/tree/master/data/external).

We used custom built Python scripts to extract the data we needed from the annotated XML files generated by CharaParser (https://github.com/jocelynpender/carex-climate-morpho). Using lxml 4.5.2 and pandas 1.1.2 (The pandas development team, 2021) packages, we transformed data from 612 parsed FNA XML files and 721 parsed FoC files into two CSV morphological datasets (one per flora). We mapped CharaParser structure and character names to our own using a structure name mapping (https://github.com/jocelynpender/carex-climate-morpho/blob/master/data/interim/fna_recode_property_names.csv; https://github.com/jocelynpender/carex-climate-morpho/blob/master/data/interim/foc_recode_property_names.csv). We included atypical measurements in our dataset (e.g., 62mm was extracted from the sentence "leaf 13–38 (–62) mm wide" and included in the dataset). We omitted relative measurements for convenience (e.g., "proximal nonbasal bracts usually equaling or shorter than inflorescences").

References

Cui H. 2012. CharaParser for fine-grained semantic annotation of organism morphological descriptions. Journal of the American Society for Information Science and Technology 63: 738–754.

Flora of North America Editorial Committee, eds. 1993+. Flora of North America North of Mexico. 21+ vols. New York and Oxford.

eFloras (2008). Published on the Internet http://www.efloras.org [accessed 2013]. Missouri Botanical Garden, St. Louis, MO & Harvard University Herbaria, Cambridge, MA.

The pandas development team. 2021. pandas-dev/pandas: Pandas 1.2.2. Zenodo. 10.5281/ZENODO.4524629.