Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a small example #302

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

add a small example #302

wants to merge 2 commits into from

Conversation

intstellar
Copy link

  1. add an indicator(bUpdate) at addPoint function, so it will call updatePoint only when we need.
  2. redefine CompareByFirst, let the smaller at the top, then we needn't use minus distance at std::priority_queue.
  3. revise VisitedList.
  4. use get_linklist_at_level to keep the code more tidy.
  5. add a simple example, it is tested under Linux.
    first, copy train-labels.idx1-ubyte, train-images.idx3-ubyte, t10k-labels.idx1-ubyte, t10k-images.idx3-ubyte to the same path of the executable file.
    then, run ./mnist

@intstellar intstellar closed this Apr 1, 2021
@intstellar intstellar reopened this Apr 1, 2021
Copy link
Member

@yurymalkov yurymalkov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! I would like to highlight a few things

  1. I suspect the code will run significantly slower due to default comparator (see the comment).
  2. I do not understand the function/logic of bUpdate. It seems like it breaks the assumption we can get the element by its label.


tableint cur_c = 0;
{
// Checking if the element with the same label already exists
// if so, updating it *instead* of creating a new element.
std::unique_lock <std::mutex> templock_curr(cur_element_count_guard_);
auto search = label_lookup_.find(label);
if (search != label_lookup_.end()) {
if (bUpdate && search != label_lookup_.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand the bUpdate logic here. It add a new element with the same label? What happens to the first element? How label_lookup_ should operate?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From sfit_1b.cpp example, it seems that you sometimes use 'label' as the unique ID of elements (when Building index), sometimes use 'label' as the class of elements (see the usage of the return value of searchKnn). I think 'lable' should mean class. For example, in MNIST its value is one of (0,1,2,3,4,5,6,7,8,9). Then, the reason of use 'bUpdate' is obviously. We don't want to update an existing class every time we add a new point. We only let this happen when we need (e.g at online training). At that time, the data of the closest node will be replaced by the new coming data.
If this is not your original intention, omit this suggestion.

Copy link
Author

@intstellar intstellar Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if the meaning of label is not class, then the question is where and how to find the classes of the elements?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the class labels should be stored by the user (in a dict or an array)
From a design point of view, there should be a mechanism for the library user to extract the vectors back/delete/update by using a key. There are no other keys defined in the library so we ended up use labels as keys. To change that there should be a good alternative to support those operations.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, it is hard to add an 'ID' at data_level0_memory_?

Copy link
Author

@intstellar intstellar Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the labels are stored in a dict or an array, it still need spend time to retrieve them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is possible, but I think it will break some stuff like compatibility with old saved indices as this label also needs to be stored somewhere.
It also would add complexity to the interface, there would be both id and and class in knn.
While labels for classifications can be done from the user side, both in C++ and python via simple lookups.

@@ -73,7 +73,7 @@ namespace hnswlib {
struct CompareByFirst {
constexpr bool operator()(std::pair<dist_t, tableint> const &a,
std::pair<dist_t, tableint> const &b) const noexcept {
return a.first < b.first;
return a.first > b.first; //let the smaller at the top
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change seems to change the logic significantly...
As far as I understand the main goal is to omit the CompareByFirst or whatever in the template, but AFAIK that severely affect the performance of the queue operation (I think the default comparator is a bit more complicated, so slower).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the "std::priority_queue" reference
"
template <class T, class Container = vector, class Compare = less > class priority_queue;

Template parameters

...

Compare
A binary predicate that takes two elements (of type T) as arguments and returns a bool.
The expression comp(a,b), where comp is an object of this type and a and b are elements in the container, shall return true if a is considered to go before b in the strict weak ordering the function defines.
The priority_queue uses this function to maintain the elements sorted in a way that preserves heap properties (i.e., that the element popped is the last according to this strict weak ordering).
This can be a function pointer or a function object, and defaults to less, which returns the same as applying the less-than operator (a<b).
"

I change "return a.first < b.first;" (this should be the default) to "return a.first > b.first;" to sort the smaller distance at the top, so we need not use minus distance. This will not "severely affect the performance".
By the way, at hnswalg.h you define many variables of "std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>". If this will affect the performance, reduce the usage of such variables will only improve the performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing the sign does not affect the performance, usage of the default comparator does (I can mark the code that does that). Slowness is an empirical fact that I've observed (I probably more lost more than few weeks before I realized that).

searchBaseLayer(tableint ep_id, const void *data_point, int layer) {
VisitedList *vl = visited_list_pool_->getFreeVisitedList();
vl_type *visited_array = vl->mass;
vl_type visited_array_tag = vl->curV;

std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst> top_candidates;
std::priority_queue<std::pair<dist_t, tableint>> top_candidates;
Copy link
Member

@yurymalkov yurymalkov Apr 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E.g. here removing CompareByFirst made the code significantly slower.

@intstellar
Copy link
Author

intstellar commented Apr 6, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants