GitHub - kreeedit/trace: TRACE - Text Reuse Analysis and Comparison Engine

TRACE - Text Reuse Analysis and Comparison Engine

TRACE is a simple Python script that compares the similarities between different text files using two methods: shingling and sentence embeddings. It allows you to specify the directory containing the text (txt) files. It also creates a network graph of the text similarities to see the relations of the different texts. The result of the analytics is stored in a json file (similarity_results.json). TRACE can be useful for tasks such as plagiarism detection, document clustering, or identifying related documents in large text collections.

USAGE

TRACE takes the following command-line arguments:

--dir1 (required): Path to the first directory containing .txt files

--dir2 (required): Path to the second directory containing .txt files

--s (optional, default=5): Number of words or characters per shingle

--mode (required, default="word"): Shingling mode, either 'word' or 'character'

--t (optional, default=1): Threshold for minimum number of shared shingles

--m (optional): Use SentenceTransformer model. If specified without a value, it uses the default model "intfloat/multilingual-e5-large"

--similarity_threshold (optional, default=0.85): Similarity threshold for SentenceTransformer comparison

These arguments allow users to customize the text comparison process, including the directories to compare, the shingling method, and whether to use additional semantic similarity analysis with a SentenceTransformer model.

python trace.py --dir1 /path/to/first/directory --dir2 /path/to/second/directory --s 5 --mode word --t 3 --m "sentence-transformers/all-MiniLM-L6-v2" --similarity_threshold 0.8

What this command will do:

Process all .txt files in both specified directories, creating 5-word shingles for each file.
Perform an initial comparison using these shingles, identifying pairs of texts that share at least 3 shingles.
For the text pairs that pass the initial shingling comparison, it will use the specified SentenceTransformer model to calculate semantic similarity between the complete docuemnts.
Keep only the text pairs whose semantic similarity score is above 0.8.
Generate a network graph visualizing the similarities between texts.
Save the results to a JSON file.

result.json (sample)

[
  {
    "dir1_file": "2Thessalonians_chapter_3.txt",
    "dir2_file": "PannHOSB_1320_IV_charter.txt",
    "similarity": 0.9021626114845276,
    "shared_shingles": [
      "domini nostri iesu christi"
    ],
    "num_shared_shingles": 1
  },

Network graph

The generated GEXF file can be opened by a variety of graph visualization app, such as Gephi, Cytoscape, and Orange. These applications allow you to view the network data as a graph, and to explore the relationships between the nodes and edges. However, the script also generates a preliminary visualization picture (text_similarity_network.png)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
daffy-trace.gif		daffy-trace.gif
requirements.txt		requirements.txt
trace.py		trace.py
visualization.png		visualization.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TRACE - Text Reuse Analysis and Comparison Engine

USAGE

result.json (sample)

Network graph

About

Releases 2

Packages

Languages

License

kreeedit/trace

Folders and files

Latest commit

History

Repository files navigation

TRACE - Text Reuse Analysis and Comparison Engine

USAGE

result.json (sample)

Network graph

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages