Skip to content

Commit

Permalink
override orginal size in prepare for msturing-10M-clustered
Browse files Browse the repository at this point in the history
  • Loading branch information
harsha-simhadri committed Jul 12, 2023
1 parent 644409b commit d29ff1a
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 2 deletions.
5 changes: 4 additions & 1 deletion benchmark/datasets.py
Original file line number Diff line number Diff line change
Expand Up @@ -432,6 +432,9 @@ def __init__(self):

def distance(self):
return "euclidean"

def prepare(self, skip_data=False, original_size=10 ** 9):
return super().prepare(skip_data, original_size = self.nb)

class MSSPACEV1B(DatasetCompetitionFormat):
def __init__(self, nb_M=1000):
Expand Down Expand Up @@ -491,7 +494,7 @@ def default_count(self):
return 10

def prepare(self, skip_data=False, original_size=10 ** 9):
return super().prepare(skip_data, self.nb)
return super().prepare(skip_data, original_size = self.nb)

class RandomRangeDS(DatasetCompetitionFormat):
def __init__(self, nb, nq, d):
Expand Down
5 changes: 4 additions & 1 deletion neurips23/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -111,12 +111,15 @@ python run.py --neurips23track streaming --algorithm diskann --dataset msspacev-

For streaming track, download the ground truth (needs azcopy in your binary path):
```
python benchmark/streaming/download_gt.py --runbook_file neurips23/streaming/simple_runbook.yaml --dataset msspacev-10M
python benchmark/streaming/download_gt.py --runbook_file neurips23/streaming/simple_runbook.yaml --dataset msspacev-10M
python benchmark/streaming/download_gt.py --runbook_file neurips23/streaming/clustered_runbook.yaml --dataset msturing-10M-clustered
```
Alternately, to compute ground truth for an arbitrary runbook, [clone and build DiskANN repo](https://github.com/Microsoft/DiskANN) and use the command line tool to compute ground truth at various search checkpoints. The `--gt_cmdline_tool` points to the directory with DiskANN commandline tools.
```
python benchmark/streaming/compute_gt.py --dataset msspacev-10M --runbook neurips23/streaming/simple_runbook.yaml --gt_cmdline_tool ~/DiskANN/build/apps/utils/compute_groundtruth
```
For streaming track, consider also the examples in [clustered runbook](neurips23/streaming/clustered_runbook.yaml). The datasets here are [generated](neurips23/streaming/clustered_data_gen.py) by clustering the original dataset with k-means and packing points in the same cluster into contiguous indices. Then insertions are then performed one cluster at a time. This runbook tests if an indexing algorithm can adapt to data draft.


To make the results available for post-processing, change permissions of the results folder

Expand Down

0 comments on commit d29ff1a

Please sign in to comment.