mmi-parser
is a rust command line tool (crate) for parsing out Fielded MetaMap Indexing (MMI) output from the National Library of Medicine's (NLM) MetaMap tool into jsonlines data.
The primary reference for the Fielded MMI output can be found here.
! Requires MetaMap 2016 (or newer) due to changes in MMI formatting !
Due to the (relatively) technical nature of running the MetaMap program (locally requires command line familiarity), it is assumed users will also be able to install and work with other command line tools (i.e. cargo).
This project uses Rust to parse the Fielded MMI output into jsonlines annotated data. While not entirely a different structure, MMI was chosen as the input and jsonlines was chosen as the output for a few reasons.
MMI is by far the most dense/compressed human-readable version of MetaMap output, so it makes logical sense to use as input to the parser.
MMI output is often put into separate .txt
files for each record being run through MetaMap. MMI output also contains one line per concept found. Jsonlines allows us to keep this 1:1 ratio. Each input .txt
file will have exactly one jsonlines output file with _parsed
suffixed to the file name to clarify it is parsed output. Jsonlines also has the added benefit of maintaining the 1:1 (concept:line) ratio that the original MMI output has. Thus each jsonline can be tracked to a line in the source (MMI output) text file. This helps with tracing results. Jsonlines, compared to traditional json, also allows file buffer reading which can be a benefit when scanning large files. Finally, while MMI already has fields associated with various parts of the text, jsonlines makes these implicit associations explicit in field names.
One drawback of outputting jsonlines is that the resulting data structure is quite nested (although that is unavoidable due to the highly nested nature of MMI which is used to keep the output dense). While this isn't a problem for data modeling, it may introduce some complications, for example, when trying to analyze the data in a tabular format like dataframes.
For example:
data/sample.txt
-->data/sample_parsed.jsonl
Where the first line indata/sample_parsed.jsonl
will represent the first (or last depending on MetaMap) construct found in the source text document but will always match the first line indata/sample.txt
.
It is worth noting that some MetaMap pipelines produce
.txt
files with a header line indicating when the file was written. Please remove these lines BEFORE running this tool.
If you need an alternative output, perhaps for a non-technical researcher, I recommend looking at jq.
Cargo is available as a part of the Rust toolchain and is readily available via curl + sh combo (see here).
To install the mmi-parser application, utilize cargo:
cargo install mmi-parser
If you also need MetaMap installed you can find instructions on how to do so here.
There is also an API available on crates.io here. The scope of this API is limited to reduce maintenance burden as the primary goal of this project was an executable parser.
The API can be installed to your local Rust project by simply adding the crate to your Cargo.toml
:
mmi-parser = "1.1.0"
Usage of MetaMap can be found extensively documented on the NLM's website or more directly in this document.
For our use case, we are going to assume you use a command similar in functionality to:
echo lung cancer | metamap -N > metamap_results.txt
The -N
flag here is important as it signals ot MetaMap to use the MMI output.
This should result in an output similar to the following:
/home/nanthony/public_mm/bin/SKRrun.20 /home/nanthony/public_mm/bin/metamap20.BINARY.Linux --lexicon db -Z 2020AA -N
USER|MMI|5.18|Carcinoma of lung|C0684249|[neop]|["LUNG CANCER"-tx-1-"lung cancer"-noun-0]|TX|0/11|
USER|MMI|5.18|Malignant neoplasm of lung|C0242379|[neop]|["Lung Cancer"-tx-1-"lung cancer"-noun-0]|TX|0/11|
USER|MMI|5.18|Primary malignant neoplasm of lung|C1306460|[neop]|["Lung cancer"-tx-1-"lung cancer"-noun-0]|TX|0/1
As you can see the output is prefaced with a log-line of my metamap installation. This line must be removed BEFORE running the mmi-parser.
In other words, we expect metamap_results.txt
to contain:
USER|MMI|5.18|Carcinoma of lung|C0684249|[neop]|["LUNG CANCER"-tx-1-"lung cancer"-noun-0]|TX|0/11|
USER|MMI|5.18|Malignant neoplasm of lung|C0242379|[neop]|["Lung Cancer"-tx-1-"lung cancer"-noun-0]|TX|0/11|
USER|MMI|5.18|Primary malignant neoplasm of lung|C1306460|[neop]|["Lung cancer"-tx-1-"lung cancer"-noun-0]|TX|0/11|
You could try to hack your way around piping the output of MetaMap to this tool but it is beyond the scope for our use case.
I would recommend sed
to remove these headers. While it is not the most performant option, its use is straightforward. Simply:
sed -i '1d' <target folder>/*.txt
The -i
removes the headers in place, and the '1d
simply means delete the first line.
Once you have some MetaMap output, you can parse it into jsonlines simply by specifying the folder in which your output lives. The mmi-parser
will go through each line in each .txt
file in the specified directory and parse it into jsonlines.
For example, in this repo there is a provided data/AA_sample.txt
which contains the sample MMI output from the explanatory document linked at the top of this file.
You can run mmi-parser
on this file by simply running:
cargo run --example parse_aa
This runs examples/parse.rs
which passes data
as the target directory to the mmi-parser
tool.
You can do the same for MmiOutput type lines by using the mmi example:
cargo run --example parse_mmi
which loads data/MMI_sample.txt
and outputs data/MMI_sample_parsed.jsonl
.
You can then see the jsonlines output in your data/sample_parsed.jsonl
which is also provided in this repo.
When running the full program (i.e. mmi-parser <FOLDER>
), the different result types will be auto-detected for you.
The tool will also show you any errors it detects and provide the file name and the line of the error in addition to the line itself. While this information is very helpful, it can sometimes be obscured by the progress bar depending on your terminal settings. Therefore it is recommended to run the program using a log-file to capture the logs while keeping the progress bar visible for sanity. For example:
mmi-parser data > errors.log
would redirect all of the messages/output to the log file where you can scan/read it for more information on the results.
It is important to note that there are two distinct output types even though three were described in the source file.
The obvious main MMI type and then we combined the remaining AA/UA types into one (AA).
In the jsonlines output you will see the first key presents the type associated with that MetaMap output line. This helps with building models/types to represent each of the possibilities and also makes for quick eye-examinations.
If you wish to use the mmi-parser crate in your application the easiest and most convenient method is to create an MmiOutput
or AaOutput
type by passing a string reference (most likely a single line of fielded MMI data from a file). The parse_record()
function will decide which of these types the record belongs to and assemble the type for you. 😃
Full API documentation can be found on docs.rs.
For analytical purposes, I would suggest combining all of these jsonlines files into one larger file and then you can process it with a tool like jq or Python - Pandas depending on your use case. 🙂
If you encounter any issues or need support please either contact @nanthony007 or open an issue.
Pull requests are welcome. For major changes, please open an issue first to discuss what you would like to change.
Please make sure to update tests as appropriate.
See CONTRIBUTING.md for more details. 😃