add a small example #302

intstellar · 2021-04-01T04:13:44Z

add an indicator(bUpdate) at addPoint function, so it will call updatePoint only when we need.
redefine CompareByFirst, let the smaller at the top, then we needn't use minus distance at std::priority_queue.
revise VisitedList.
use get_linklist_at_level to keep the code more tidy.
add a simple example, it is tested under Linux.
first, copy train-labels.idx1-ubyte, train-images.idx3-ubyte, t10k-labels.idx1-ubyte, t10k-images.idx3-ubyte to the same path of the executable file.
then, run ./mnist

yurymalkov

Thank you for the PR! I would like to highlight a few things

I suspect the code will run significantly slower due to default comparator (see the comment).
I do not understand the function/logic of bUpdate. It seems like it breaks the assumption we can get the element by its label.

yurymalkov · 2021-04-05T02:45:25Z

hnswlib/hnswalg.h


            tableint cur_c = 0;
            {
                // Checking if the element with the same label already exists
                // if so, updating it *instead* of creating a new element.
                std::unique_lock <std::mutex> templock_curr(cur_element_count_guard_);
                auto search = label_lookup_.find(label);
-                if (search != label_lookup_.end()) {
+                if (bUpdate && search != label_lookup_.end()) {


I do not understand the bUpdate logic here. It add a new element with the same label? What happens to the first element? How label_lookup_ should operate?

From sfit_1b.cpp example, it seems that you sometimes use 'label' as the unique ID of elements (when Building index), sometimes use 'label' as the class of elements (see the usage of the return value of searchKnn). I think 'lable' should mean class. For example, in MNIST its value is one of (0,1,2,3,4,5,6,7,8,9). Then, the reason of use 'bUpdate' is obviously. We don't want to update an existing class every time we add a new point. We only let this happen when we need (e.g at online training). At that time, the data of the closest node will be replaced by the new coming data.
If this is not your original intention, omit this suggestion.

But if the meaning of label is not class, then the question is where and how to find the classes of the elements?

I think the class labels should be stored by the user (in a dict or an array)
From a design point of view, there should be a mechanism for the library user to extract the vectors back/delete/update by using a key. There are no other keys defined in the library so we ended up use labels as keys. To change that there should be a good alternative to support those operations.

I don't understand, it is hard to add an 'ID' at data_level0_memory_?

If the labels are stored in a dict or an array, it still need spend time to retrieve them.

It is possible, but I think it will break some stuff like compatibility with old saved indices as this label also needs to be stored somewhere.
It also would add complexity to the interface, there would be both id and and class in knn.
While labels for classifications can be done from the user side, both in C++ and python via simple lookups.

yurymalkov · 2021-04-05T02:48:13Z

hnswlib/hnswalg.h

@@ -73,7 +73,7 @@ namespace hnswlib {
        struct CompareByFirst {
            constexpr bool operator()(std::pair<dist_t, tableint> const &a,
                                      std::pair<dist_t, tableint> const &b) const noexcept {
-                return a.first < b.first;
+                return a.first > b.first; //let the smaller at the top


That change seems to change the logic significantly...
As far as I understand the main goal is to omit the CompareByFirst or whatever in the template, but AFAIK that severely affect the performance of the queue operation (I think the default comparator is a bit more complicated, so slower).

see the "std::priority_queue" reference
"
template <class T, class Container = vector, class Compare = less > class priority_queue;

Template parameters

...

Compare
A binary predicate that takes two elements (of type T) as arguments and returns a bool.
The expression comp(a,b), where comp is an object of this type and a and b are elements in the container, shall return true if a is considered to go before b in the strict weak ordering the function defines.
The priority_queue uses this function to maintain the elements sorted in a way that preserves heap properties (i.e., that the element popped is the last according to this strict weak ordering).
This can be a function pointer or a function object, and defaults to less, which returns the same as applying the less-than operator (a<b).
"

I change "return a.first < b.first;" (this should be the default) to "return a.first > b.first;" to sort the smaller distance at the top, so we need not use minus distance. This will not "severely affect the performance".
By the way, at hnswalg.h you define many variables of "std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst>". If this will affect the performance, reduce the usage of such variables will only improve the performance.

Changing the sign does not affect the performance, usage of the default comparator does (I can mark the code that does that). Slowness is an empirical fact that I've observed (I probably more lost more than few weeks before I realized that).

yurymalkov · 2021-04-06T03:11:07Z

hnswlib/hnswalg.h

        searchBaseLayer(tableint ep_id, const void *data_point, int layer) {
            VisitedList *vl = visited_list_pool_->getFreeVisitedList();
            vl_type *visited_array = vl->mass;
            vl_type visited_array_tag = vl->curV;

-            std::priority_queue<std::pair<dist_t, tableint>, std::vector<std::pair<dist_t, tableint>>, CompareByFirst> top_candidates;
+            std::priority_queue<std::pair<dist_t, tableint>> top_candidates;


E.g. here removing CompareByFirst made the code significantly slower.

intstellar · 2021-04-06T03:28:37Z

Thank you for let me know that. May be we can implement a "priority_deque" so we need not copy data frequently.06.04.2021, 11:10, "Yury Malkov" ***@***.***>: @yurymalkov commented on this pull request. In hnswlib/hnswalg.h:

@@ -73,7 +73,7 @@ namespace hnswlib {

struct CompareByFirst { constexpr bool operator()(std::pair<dist_t, tableint> const &a, std::pair<dist_t, tableint> const &b) const noexcept { - return a.first < b.first; + return a.first > b.first; //let the smaller at the top Changing the sign does not affect the performance, usage of the default comparator does (I can mark the code that does that). Slowness is an empirical fact that I've observed (I probably more lost more than few weeks before I realized that). —You are receiving this because you modified the open/close state.Reply to this email directly, view it on GitHub, or unsubscribe.

intstellar and others added 2 commits March 16, 2021 15:59

see changelog0316

de883ea

Delete settings.json

72b6e8f

intstellar closed this Apr 1, 2021

intstellar reopened this Apr 1, 2021

yurymalkov requested changes Apr 5, 2021

View reviewed changes

yurymalkov reviewed Apr 6, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add a small example #302

add a small example #302

intstellar commented Apr 1, 2021

yurymalkov left a comment

yurymalkov Apr 5, 2021

intstellar Apr 6, 2021

intstellar Apr 6, 2021 •

edited

Loading

yurymalkov Apr 6, 2021

intstellar Apr 6, 2021

intstellar Apr 6, 2021 •

edited

Loading

yurymalkov Apr 6, 2021

yurymalkov Apr 5, 2021

intstellar Apr 6, 2021

yurymalkov Apr 6, 2021

yurymalkov Apr 6, 2021 •

edited

Loading

intstellar commented Apr 6, 2021 via email

add a small example #302

Are you sure you want to change the base?

add a small example #302

Conversation

intstellar commented Apr 1, 2021

yurymalkov left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

intstellar Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

intstellar Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yurymalkov Apr 6, 2021 • edited Loading

Choose a reason for hiding this comment

intstellar commented Apr 6, 2021 via email

intstellar Apr 6, 2021 •

edited

Loading

intstellar Apr 6, 2021 •

edited

Loading

yurymalkov Apr 6, 2021 •

edited

Loading