Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset dtype chaos in nemo-analytics #925

Open
3 tasks
jorvis opened this issue Oct 24, 2024 · 2 comments
Open
3 tasks

Dataset dtype chaos in nemo-analytics #925

jorvis opened this issue Oct 24, 2024 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@jorvis
Copy link
Member

jorvis commented Oct 24, 2024

The dataset types in the gEAR portal are nice and tidy:

mysql> select dtype, count(dtype) from dataset group by dtype;
+-----------------------+--------------+
| dtype                 | count(dtype) |
+-----------------------+--------------+
| single-cell-rnaseq    |         1165 |
| bulk-rnaseq           |          325 |
| bargraph-standard     |          138 |
| epiviz                |          125 |
| microarray            |           72 |
| svg-expression        |           25 |
| atac-seq              |            2 |
| violin-standard       |           27 |
| linegraph-standard    |           17 |
| image-static-standard |            4 |
| image-static          |            2 |
+-----------------------+--------------+

But in Nemo Analytics, chaos reigns:

mysql> select dtype, count(dtype) from dataset group by dtype;
+------------------------------------------------+--------------+
| dtype                                          | count(dtype) |
+------------------------------------------------+--------------+
| snRNA-seq                                      |          269 |
| microarray                                     |           71 |
| single-cell-rnaseq                             |          683 |
| scRNA-seq                                      |         1474 |
| scRNAseq                                       |           89 |
| bulk-rnaseq                                    |          215 |
| Visium                                         |           15 |
| MERFISH                                        |           76 |
| Stereo-Seq                                     |           98 |
| Sci-Space                                      |           14 |
| epiviz                                         |           53 |
| RNA-seq                                        |           17 |
| SMARTseq                                       |            5 |
| single cell rna-seq                            |           22 |
| Bulk RNAseq                                    |           22 |
| Bulk polyA RNAseq                              |            2 |
| snRNAseq                                       |            1 |
| slide-seq                                      |           11 |
| sc-rna-seq                                     |            8 |
| scRNA-seq (GRCm38/mm10)                        |            5 |
| scMultiome                                     |            3 |
| ribo-seq                                       |            1 |
| scRNA-seq  (GRCm38/mm10)                       |            2 |
| bulk-RNA                                       |            1 |
| scATAC-seq                                     |            3 |
| snsRNA-seq                                     |            3 |
| ISH                                            |            5 |
| single-cell rna seq                            |            4 |
| Bulk RiboZero RNAseq                           |            2 |
| Illumina HumanHT12v4 microarray                |            2 |
| Microarray expression data                     |            2 |
| chip-seq                                       |            2 |
| proteomics                                     |            4 |
| microarry                                      |            1 |
| bulk-rnaseq, single-cell-rnaseq, microarray    |            1 |
| Laser Capture Microdissection RNAseq           |            2 |
| sorted cell bulk-rnaseq                        |            1 |
| sci-RNA-seq                                    |            4 |
| split-Seq                                      |            2 |
| DNA methylation data from Illumina microarrays |            1 |
| Affymetrix                                     |            1 |
+------------------------------------------------+--------------+

There are places all over our API where a select number of dtypes are expected and datasets don't appear unless they match these expectations (such as the Dataset Explorer). I think this is reasonable.

Two things need to happen here:

  • Verify the new uploader limits dtypes to an expected list
  • Correct these existing entries where possible
  • Update the utility scripts to also ensure dtypes aren't just accepted as free-form strings
@jorvis jorvis added the bug Something isn't working label Oct 24, 2024
@jorvis jorvis self-assigned this Oct 24, 2024
@adkinsrs
Copy link
Member

adkinsrs commented Oct 24, 2024

Following this thread, as it also relates to the NeMO Archive import system. This is something I was chatting with Kemi about as I will have to convert their large list of dataset "tissue types" from the NeMO Archive assets into our limited set of them. I also made a request that the NeMO Archive assets API would have an endpoint that if hit would return all available tissue types (amongst other controlled vocabularies as well)

@adkinsrs
Copy link
Member

def tissue_type_to_dataset_type(tissue_type):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants