-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dealing with overrepresented concepts / blacklisting #735
Comments
Thank you for the suggestion. Indeed this seems like a recurring problem, so a generic mechanism could be useful. This was one of the ideas discussed in issue #538, especially in #538 (comment) . But there were maybe too many ideas thrown around and so far nothing has been implemented. So let's keep this issue focused on only the problem of overrepresented concepts and the possible solution to make it possible to block problematic concepts, since it seems that both ZBW and ZPID have already decided to use such a mechanism implemented outside Annif. I think this configuration example from #538 (comment) is still valid:
and the meaning of this would be that the two concepts (USA and Theory) listed in As noted in #538, it would make sense to avoid the term "blacklisting" due to connotations. I think "exclude", "block" or "deny" are all valid alternatives. |
I've thought about the best way to implement something like this in Annif code. I think this should be a general mechanism and ideally no changes to individual backend implementations should be necessary. This means that the setting should be handled on the level of AnnifProject. One possibility is that SubjectIndex would be made aware of the blocked/excluded concepts, similar to how it already handles deprecated concepts. For the configuration, this could be implemented as an extra option to the
When there's no need to set the language, this could work as well:
One minor syntax consideration here is that commas are already used to separate different parameters, so it's not possible to use commas as a separator between concept URIs. Above I've used spaces instead, but other symbols such as |
Just throwing in the idea: could the denylisting be (also) "dynamic", in the sense that the suggest request could include a parameter containing the concepts that are not wanted at that particular time? I think there could be some users of Annif API that could benefit from this. This could be useful for e.g. university repositories, as very many theses and dissertations get the "final projects (education)" concept as a unwanted and redundant suggestion. I assume they now exclude that concept in their own system(?) to not show it to the student. Another use case would be to restrict the suggestions using the ontology hierarchy e.g. to only all physical objects or some groups. There could be even a UI component where a user could select the allowed or denied concepts in the hierarchy tree. That would be cool, but maybe not so useful. |
Several institutions have observed that some models / ensembles struggle with concepts that are overrepresented in the training data so that they are suggested way to often. One fix for that is to identify rules that limit the contexts in which those concepts can be suggested. Could we implement something in Annif that allows specifying those rules?
CC @schlawiner @Lakshmi-bashyam
related: #538 ; #596
The text was updated successfully, but these errors were encountered: