Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

mlissner · 2023-04-24T23:43:29Z

Currently we have cl_update_index, but I think django-elasticsearch might have something it supplies out of the box, or perhaps we'll need to make our own to get the performance we need. In any case, we'll need something like this so we can ingest the old data in the DB.

albertisfu · 2023-05-30T15:27:16Z

According to #2676 cl_update_index main functionality is to populate the index for the first time.

Django elasticsearch DSL has some built-in commands to populate existing data to the index.
It has a --parallel option that is based on elasticsearch-py parallel_bulk that uses multiprocessing.pool.ThreadPool, by default it uses 4 threads and a chunk size of 500 objects to be sent to ES at once.

docker exec -it cl-django python /opt/courtlistener/manage.py search_index --rebuild --models audio.Audio --parallel
(Creates the index and populates it.)

docker exec -it cl-django python /opt/courtlistener/manage.py search_index --populate --models audio.Audio --parallel
(Index exists, only populates it.)

I did some tests, adding 20,000 Audio objects to ES which took ~35 minutes (using 1 shard).

However, there are a couple of issues related to parallel_bulk that mention some performance and memory leak issues indexing millions of items.

elastic/elasticsearch-py#1101
django-es/django-elasticsearch-dsl#433 (This one mentions that the issue is present when using MYSQL)

So we could try this command and see how it works for OA, or we could directly create a new cl_update_index ES based on celery.

What do you think?

mlissner · 2023-05-30T16:48:13Z

Yeah, let's give it a try and see how far we can take it. It'll be great if it's good enough and we can use it without needing to get Celery involved (!)

mlissner · 2023-10-26T18:19:34Z

I think it's safe to say this is done. As we need more data types, we'll add them.

mlissner added the search label Apr 24, 2023

mlissner closed this as completed Oct 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

mlissner commented Apr 24, 2023

albertisfu commented May 30, 2023

mlissner commented May 30, 2023

mlissner commented Oct 26, 2023

Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

Create new script for adding oral argument (and later other forms of) data to Elastic Search #2677

Comments

mlissner commented Apr 24, 2023

albertisfu commented May 30, 2023

mlissner commented May 30, 2023

mlissner commented Oct 26, 2023