Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Show similar mods on mod page #424

Open
HebaruSan opened this issue Nov 6, 2021 · 5 comments · May be fixed by #438
Open

[Feature] Show similar mods on mod page #424

HebaruSan opened this issue Nov 6, 2021 · 5 comments · May be fixed by #438
Assignees
Labels
Area: Backend Related to the Python code that runs inside gunicorn Area: Frontend Related to HTML, JS, CSS, or other browser things Area: Infrastructure Related to server stuff outside gunicorn, especially ATS Area: Migration Related to Alembic database migrations Priority: Low Type: Feature

Comments

@HebaruSan
Copy link
Contributor

HebaruSan commented Nov 6, 2021

Motivation

Currently users can only find mods based on the featured list, creation/update time, overall popularity, and (a currently rather poor) text search. These features are only available via the mod listing pages specifically made for it. If a user happens to open a mod page from off-site, there is no easy path to finding more mods they might like.

image

Suggestion

We could add a Similar Mods list at the bottom of the mod page that would show a few (6? 12? unlimited paginated?) mods ranked by how similar they are to the main mod. Visually, it should be a pretty simple matter to re-use the existing mod box styling and functionality, kind of like:

image

Data model

I imagine implementing this with a new ModsSimilarity table to store similarities:

Column Purpose
main_mod_id Stores Mod.id of one of the mods being compared
other_mod_id Stores Mod.id of the other mod being compared
similarity A number that is larger for more similar mods and smaller for less similar mods

An index of (main_mod_id, similarity DESC) would allow us to quickly get the mods most similar to a given mod from other_mod_id of the rows returned. We would have to create two rows per pair of mods under this model, with the id values swapped in the two *_mod_id columns, but I think that may be the least bad approach anyway.

With 2913 mods currently in the db (counting deleted ones because I don't have an easy way to exclude them), there would be 8,485,569 rows in the table.

Calculating similarity values

We would probably base the similarity on a weighted sum of comparisons of these columns:

  • Mod.game_id - 1 if same, 0 if different
  • [Mod.user_id, SharedAuthor.user_id] (the authors) - 1 if all authors are same, 0 if all authors are different, fractions for partial matches
  • Mod.name
  • Mod.short_description
  • Mod.description
  • Mod.default_version.changelog (most recent changelog, maybe)
  • Mod.background (image files, maybe)

Ideally we would delegate the comparison of the string columns to a machine learning library with an interface like:

def get_string_similarity(s1: str, s2: str) -> float:
    """ Compare the strings with AI """

There are many such open source libraries, including for Python, but so far I have not found one that would make it that easy. They generally would require us to:

  • Maintain a lexicon of known words, which would probably have to be stored in its own new table to keep it consistent between runs
  • Tokenize the input strings into words and then into numbers using the lexicon
  • Provide training data, effectively a long list of pairs of strings and our interpretation of the "correct" similarity values
  • Store the trained neural network weights somewhere
  • Load the trained data when we want to compare strings

So rather than having "an AI" do the hard work for us, we would have to tell it that "probe" and "satellite" are similar but "future" and "SPH" are not, etc., and then micromanage its memory for it and fiddle with it until its comparisons looked acceptable. At that point we might be better off writing our own simpler ad hoc heuristic logic.

It would be nice if we could detect when the user clicks a similar mod link and use that to update the comparison of the mods, since in that case a human is confirming the similarity. I'm not sure how we would do that.

Batching the calculations

To get started, we would need to compare every mod with every other mod (O(N²) in the number of mods). Then as mods were created and edited and updated, we would have to re-compare the changed mod with all the other mods (O(N)). This probably isn't something we could run in the foreground on any page. Ideally we would add mods that need re-comparison to a queue and then have a background task perform the comparisons and update the db.

@HebaruSan HebaruSan added Area: Backend Related to the Python code that runs inside gunicorn Priority: Low Type: Feature Area: Frontend Related to HTML, JS, CSS, or other browser things Area: Migration Related to Alembic database migrations Area: Infrastructure Related to server stuff outside gunicorn, especially ATS labels Nov 6, 2021
@HebaruSan
Copy link
Contributor Author

HebaruSan commented Dec 4, 2021

Specific idea for how to calculate the author comparison for author lists A and B:

image

This would have the desired properties that author lists with no intersection would be assigned a value of 0, and author lists that match completely would be assigned 1. Two 2-author mods with 1 in common would be assigned ⅓ (one in both divided by three total). If both of those mods add the same new author, it would become 0.5 (two in both divided by four total).

This could also be adapted as a very simple algorithm to compare description strings, substituting words for authors and dropping a list of known meaning-free words like "the" and "a". This would not handle synonyms, but maybe mod authors use identical words often enough in practice for that to not matter. Might have to try it to find out.

@HebaruSan
Copy link
Contributor Author

HebaruSan commented Dec 4, 2021

Playing with this a bit, long descriptions tend to use a lot of words that don't convey meaning, and a lot of the "meaning" isn't describing the mod (e.g., installation instructions, maintainer history, etc.). Comparing a shorter description with a longer description looks close to hopeless due to the disparate number of "extra" words used to make similar points.

A variant would be to count matches as worth the word's length instead of 1, so longer words are worth more than shorter words, on the assumption that these are likely to be more meaningful. Unfortunately this seems to make the similarities of similar mods even lower (thanks to all those long non-matching words).

Mod1 Mod2 Per-word weighting Per-letter weighting
ReStock ReStockPlus 0.050 0.035
NearFutureSolar NearFutureElectrical 0 0
NearFuturePropulsion CryoEngines 0.123 0.091

@HebaruSan
Copy link
Contributor Author

Dialing back the ambition significantly, maybe we should settle for:

  • A "More by (Author)" section at the bottom of the mod page
  • Detecting when mods mention other mods (compare names to descriptions?)
  • Tokenizing words based on capitalization so "ModuleManager" matches "module manager"

@HebaruSan
Copy link
Contributor Author

HebaruSan commented Dec 4, 2021

The descriptions for Scatterer and EVE Redux have almost nothing in common. 😭

@HebaruSan
Copy link
Contributor Author

My prototype is shaping up, this might work. A few more notes:

  • Mods for different games should have a similarity of 0
  • Mods with similarity of 0 shouldn't be stored in the table
  • Maybe for each mod we could store only the similarities of the 18 most similar mods in the table (3 rows of 6, 1 row visible by default)? The rest are pretty much useless and unlikely to ever be needed. This would reduce the number of new rows from 8,485,569 to 52,434. It would require having all of the comparisons for one mod in memory at one time so we could sort them, but my prototype essentially does that right now via the API, and it seems OK. Though it might be challenging to maintain that data model since it would require us to delete rows selectively.

@HebaruSan HebaruSan changed the title [Feature] Show related mods on mod page [Feature] Show similar mods on mod page Dec 16, 2021
@HebaruSan HebaruSan linked a pull request Dec 17, 2021 that will close this issue
@HebaruSan HebaruSan linked a pull request Dec 17, 2021 that will close this issue
@HebaruSan HebaruSan self-assigned this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: Backend Related to the Python code that runs inside gunicorn Area: Frontend Related to HTML, JS, CSS, or other browser things Area: Infrastructure Related to server stuff outside gunicorn, especially ATS Area: Migration Related to Alembic database migrations Priority: Low Type: Feature
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant