Add PostgresSQL backend for performance analysis #26

nichtich · 2022-08-30T08:40:22Z

Sounds good. I still have to investigate the performance issues though. Currently on our server it's pretty slow, although it mostly works fine on my local machine. Might be related to drive speeds. If we use a hosted PostgreSQL database, we might get much better performance. In theory, it wouldn't be a big deal to support both SQLite and PostgreSQL, and we could offer a tiny setup script that differentiates between the two.

Originally posted by @stefandesu in #17 (comment)

nichtich · 2022-09-16T07:50:10Z

To add some consistency check on the database, vocabularies should also be put into the backend:

-- technically PPN is an integer with checksum so more could be improved later
CREATE DOMAIN ppn AS TEXT CHECK (VALUE ~* '^[0-9]+[0-9X]$');

CREATE TABLE IF NOT EXISTS Vocabulary (
  key text NOT NULL,
  jskos json NOT NULL DEFAULT '{}'::json,
  PRIMARY KEY (key),
  CONSTRAINT valid_key CHECK (key ~* '^[a-z]+$')
);

CREATE TABLE IF NOT EXISTS Subject (
  ppn ppn NOT NULL,
  voc text NOT NULL,
  notation text NOT NULL,
  FOREIGN KEY (voc) REFERENCES Vocabulary (key),
);

stefandesu · 2022-10-20T10:35:41Z

Vocabulary table is not yet added.
Batch import of all subjects takes long (about 48 minutes), even on my pretty fast machine. I think this could be optimized.
We should test it on our server and compare the performance between PostgreSQL and SQLite.

stefandesu · 2022-10-21T10:30:21Z

Batch import can be optimized by using COPY FROM instead if INSERT. It works well for me locally and reduces the raw import time by an order of magnitude, I would say. However, recreating the indexes still takes a while with multiple tens of millions of records.

There are still some issues to be sorted out with this though, and I think we can do something similar for SQLite as well (although SQLite is, for some reason, much faster with inserting data).

stefandesu · 2022-10-26T11:20:47Z

I think I'll be able to push my changes with batch import soon. Then we can test the performance between the two backends on our server. For some reason, on my local machine, SQLite is significantly faster than PostgreSQL, although in the past I had the opposite experience. I think much depends on disk performance which is fairly bad on our server...

stefandesu · 2022-10-27T10:31:47Z

Changes are now pushed to Dev. I'll do the performance comparison soon.

stefandesu · 2022-10-28T09:09:02Z

Okay, while I didn't do any scientific tests on the performance, my results are fairly clear (these are all performed on our server with the current Dev version of occurrences-api):

Importing all K10plus subjects takes about 1.5x as long with SQLite compared to PostgreSQL (11:26 vs 7:41).
- Note that we're using proper batch importing in Postgres (i.e. giving Postgres the TSV file directly) which speeds up the import. This is possible with SQLite as well, but I haven't found an elegant solution in Node.js.
Requesting co-occurrences for BK 18.10 (which is one of the most-used notations in the catalog) is about 2x faster with SQLite compared to PostgreSQL (0:30-1:00 vs 1:30-2:00). ALSO subsequent queries using the same parameters seemed to be cached by SQLite, but not by PostgreSQL (with SQLite, repeating the query takes about 3 seconds).
Requesting co-occurrences for RVK NQ 2350 is also about 2x faster with SQLite compared to PostgreSQL (4 seconds vs 8 seconds). Subsequent queries using the same parameters seemed to be cached by both, resulting in query times less than 0.1 seconds.
- My suspicion is that with BK 18.10, the resulting set is so big that it is above some cache limit for PostgreSQL and thus not kept in cache. I have not looked into caching with these database systems though.

Overall, SQLite seems to be about 2x faster for the usual queries and seems to have a higher cache limit. This is not what I expected, to be honest, especially since our dataset has over 80 million rows. It seems like staying with SQLite is the better choice in our case, even though I expect things could be optimized for PostgreSQL more.

Also in both cases, performance is severely limited by our server's slow disk performance. My laptop (which has a fast NVMe SSD) is about 4-5x faster.

nichtich · 2023-03-03T07:33:29Z

b5c3a41 added backend method subjects not implemented in PostsgreSQL backend yet.

stefandesu · 2023-03-14T09:43:49Z

The question is whether we really want to support both SQLite and PostgreSQL in the long run. In theory, there won't be too many things to add, so there might not be much work, but if we decide to stay with SQLite anyway, it might be better to remove PostgreSQL support again. What do you think?

nichtich · 2023-03-14T16:46:06Z

We may later drop PostgreSQL but SPARQL and in particular SRU are needed so we will have multiple backends with different capabilities anyway. Let's keep it as experimental.

nichtich mentioned this issue Aug 30, 2022

Add import script #27

Closed

nichtich mentioned this issue Sep 16, 2022

Include information about items and libraries #29

Open

stefandesu added a commit that referenced this issue Oct 20, 2022

Add experimental PostgreSQL backend (#26)

3b1717a

stefandesu added a commit that referenced this issue Oct 27, 2022

Adjust batch importing; vastly improve Postgres batch import (#26)

9262801

nichtich mentioned this issue Mar 2, 2023

Add SPARQL backend #31

Open

nichtich added the backend Requires work on the backend label Jul 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add PostgresSQL backend for performance analysis #26

Add PostgresSQL backend for performance analysis #26

nichtich commented Aug 30, 2022 •

edited

Loading

nichtich commented Sep 16, 2022

stefandesu commented Oct 20, 2022 •

edited

Loading

stefandesu commented Oct 21, 2022

stefandesu commented Oct 26, 2022

stefandesu commented Oct 27, 2022

stefandesu commented Oct 28, 2022

nichtich commented Mar 3, 2023

stefandesu commented Mar 14, 2023

nichtich commented Mar 14, 2023

Add PostgresSQL backend for performance analysis #26

Add PostgresSQL backend for performance analysis #26

Comments

nichtich commented Aug 30, 2022 • edited Loading

nichtich commented Sep 16, 2022

stefandesu commented Oct 20, 2022 • edited Loading

stefandesu commented Oct 21, 2022

stefandesu commented Oct 26, 2022

stefandesu commented Oct 27, 2022

stefandesu commented Oct 28, 2022

nichtich commented Mar 3, 2023

stefandesu commented Mar 14, 2023

nichtich commented Mar 14, 2023

nichtich commented Aug 30, 2022 •

edited

Loading

stefandesu commented Oct 20, 2022 •

edited

Loading