We will use tf-idf Vector Space Modelling (VSM) of documents to measure the similarity between the bug report and all source code files. For the hands-on, we will skip the various pre-processing stages, and only use English natural language stopwords filtering.
We will use scikit-learn to implement the vectorization and the similarity measurement.
The provided irfl.py
file has a skeleton to implement the IRFL heuristic. For the tf-idf vectorisation, we will use the TfidfVectorizer
from the sklearn
package (sklearn.feature_extraction.text.TfidfVectorizer
). The API documentation is here. Note that you can submit a list of filenames to the vectorizer. This is why the step 1 is to collect all filenames. Step 2 is to use TfidfVectorizer to get the vector representations.
- Collect all documents (i.e., the bug report and all source files):
- Compute tf-idf vectors of each document
Given a matrix (i.e., a vector of vectors), you can use the pairwise cosine_similarity
function from sklearn
(sklearn.metrics.pairwise.cosine_similarity
), whose documentation is here.
- Compute cosine similarity between each vector
- Rank source files using the similarity
- Report the top five files