DEXA

Code for DEXA: Deep Encoders with Auxiliary Parameters for Extreme Classification [1]

Setting up

Expected directory structure

+-- <work_dir>
|  +-- programs
|  |  +-- dexa
|  |    +-- dexa
|  +-- data
|    +-- <dataset>
|  +-- models
|  +-- results

Download data for DEXA

* Download the (zipped file) raw data from The XML repository [5].  
* Extract the zipped file into data directory. 
* The following files should be available in <work_dir>/data/<dataset> (create empty filter files if unavailable):
    - trn.json.gz
    - tst.json.gz
    - lbl.json.gz
    - filter_labels_text.txt
    - filter_labels_train.txt

Example use cases

A single learner

Extract and tokenize data as follows.

./prepare_data.sh LF-AmazonTitles-131K 32

The algorithm can be run as follows. A json file (e.g., config/DEXA/LF-AmazonTitles-131K.json) is used to specify architecture and other arguments. Please refer to the full documentation below for more details.

./run_main.sh 0 DEXA LF-AmazonTitles-131K 0 108

Full Documentation

Tokenize the data

./prepare_data.sh <dataset> <seq-len>

* dataset
  - Name of the dataset.
  - Tokenizer expects the following files in <work_dir>/data/<dataset>
    - trn.json.gz
    - tst.json.gz
    - lbl.json.gz
  - it'll dump the following six tokenized files 
    - trn_doc_input_ids.npy
    - trn_doc_attention_mask.npy
    - tst_doc_input_ids.npy
    - tst_doc_attention_mask.npy
    - lbl_input_ids.npy
    - lbl_attention_mask.npy

* seq-len
  - sequence length of text to consider while tokenizing
  - 32 for titles dataset
  - 256 for Wikipedia
  - 128 for other full-text datasets

Run DEXA

./run_main.sh <gpu_id> <type> <dataset> <version> <seed>

* gpu_id: Run the program on this GPU.

* type
  DEXA builds upon NGAME[2], SiameseXML [3] and DeepXML[4] for training. An encoder is trained in M1 and the classifier is trained in M-IV.
  - DEXA: The intermediate representation is not fine-tuned while training the classifier (more scalable; suitable for large datasets).

* dataset
  - Name of the dataset.
  - DEXA expects the following files in <work_dir>/data/<dataset>
    - trn_doc_input_ids.npy
    - trn_doc_attention_mask.npy
    - trn_X_Y.txt
    - tst_doc_input_ids.npy
    - tst_doc_attention_mask.npy
    - tst_X_Y.txt
    - lbl_input_ids.npy
    - lbl_attention_mask.npy
    - filter_labels_test.txt (put empty file or set as null in config when unavailable)

* version
  - different runs could be managed by version and seed.
  - models and results are stored with this argument.

* seed
  - seed value as used by numpy and PyTorch.

Cite as

@InProceedings{Dahiya23b,
    author = "Dahiya, K. and Yadav, S. and Sondhi, S. and Saini, D. and Mehta, S. and Jiao, J. and Agarwal, S. and Kar, P. and Varma, M.",
    title = "Deep encoders with auxiliary parameters for extreme classification",
    booktitle = "KDD",
    month = "August",
    year = "2023"
}

References

[1] K. Dahiya, S. Yadav, S. Sondhi, D. Saini, S. Mehta, J. Jiao, S. Agarwal, P. Kar and M. Varma. Deep encoders with auxiliary parameters for extreme classification. In KDD, Long Beach (CA), August 2023.

[2] K. Dahiya, N. Gupta, D. Saini, A. Soni, Y. Wang, K. Dave, J. Jiao, K. Gururaj, P. Dey, A. Singh, D. Hada, V. Jain, B. Paliwal, A. Mittal, S. Mehta, R. Ramjee, S. Agarwal, P. Kar and M. Varma. NGAME: Negative mining-aware mini-batching for extreme classification. In WSDM, Singapore, March 2023.

[2] K. Dahiya, A. Agarwal, D. Saini, K. Gururaj, J. Jiao, A. Singh, S. Agarwal, P. Kar and M. Varma. SiameseXML: Siamese networks meet extreme classifiers with 100M labels. In ICML, July 2021

[3] K. Dahiya, D. Saini, A. Mittal, A. Shaw, K. Dave, A. Soni, H. Jain, S. Agarwal, and M. Varma. Deepxml: A deep extreme multi-label learning framework applied to short text documents. In WSDM, 2021.

[4] pyxclib: https://github.com/kunaldahiya/pyxclib

[5] The Extreme Classification Repository: http://manikvarma.org/downloads/XC/XMLRepository.html

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dexa		dexa
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEXA

Setting up

Expected directory structure

Download data for DEXA

Example use cases

A single learner

Full Documentation

Tokenize the data

Run DEXA

Cite as

YOU MAY ALSO LIKE

References

About

Releases

Packages

Languages

Extreme-classification/dexa

Folders and files

Latest commit

History

Repository files navigation

DEXA

Setting up

Expected directory structure

Download data for DEXA

Example use cases

A single learner

Full Documentation

Tokenize the data

Run DEXA

Cite as

YOU MAY ALSO LIKE

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages