Software Entity Recognition with Noisy-label Learning

Official implementation for our paper "Software Entity Recognition with Noise-robust Learning", ASE 2023.

WikiSER

WikiSER corpus includes 1.7M sentences with named entity labels extracted from 79k Wikipedia articles. Relevant software named entities are labeled under 12 fine-grained categories:

Type	Examples
Algorithm	Auction algorithm, Collaborative filtering
Application	Adobe Acrobat, Microsoft Excel
Architecture	Graphics processing unit, Wishbone
Data_Structure	Array, Hash table, mXOR linked list
Device	Samsung Gear S2, iPad, Intel T5300
Error Name	Buffer overflow, Memory leak
General_Concept	Memory management, Nouvelle AI
Language	C++, Java, Python, Rust
Library	Beautiful Soup, FastAPI
License	Cryptix General License, MIT License
Operating_System	Linux, Ubuntu, Red Hat OS, MorphOS
Protocol	TLS, FTPS, HTTP 404

WikiSER is organized by the Wiki article in which the data was scraped from.

|-- Adobe_Flash.txt
|-- Linux.txt
|-- Java_(programming_language).txt
|-- ...

Each sentences are split by <s>...</s> and tokenized with stokenizer.

Download the full dataset from Huggingface or this folder.

Models

The finetuned checkpoints are available through HuggingFace: wikiser-bert-case and wikiser-bert-large.

You can load in the model by the standard API:

from transformers import AutoTokenizer, AutoModelForTokenClassification

tokenizer = AutoTokenizer.from_pretrained("taidng/wikiser-bert-base")
model = AutoModelForTokenClassification.from_pretrained("taidng/wikiser-bert-base")

Train with Self-regularization

We suggest using conda to set up your environment. To begin, create a new environment using environment.yml, naming it "ser" by default.

conda env create -f environment.yml

To start training script with BERT and self-regularization:

python3 train_nll.py --model_name_or_path=bert-base-cased --alpha=10 --n_model=2 --dropout_prob=0.1 --data_dir=data/wikiser-small --epochs=25

--alpha: positive multiplier to weighing the agreement loss
--n_model: k number of forward passes for regularization
--data_dir: Specify one dataset out of wikiser-small, sner, and relabeled softner-9.

By default, training loss and evaluation statistics are stored in wandb.

Citation

If you find our work helpful, please cite:

@inproceedings{nguyen2023software,
  title={Software Entity Recognition with Noise-Robust Learning},
  author={Nguyen, Tai and Di, Yifeng and Lee, Joohan and Chen, Muhao and Zhang, Tianyi},
  booktitle={Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE'23)},
  year={2023},
  organization={IEEE/ACM}
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Software Entity Recognition with Noisy-label Learning

WikiSER

Models

Train with Self-regularization

Citation

About

Releases

Packages

Languages

License

taidnguyen/software_entity_recognition

Folders and files

Latest commit

History

Repository files navigation

Software Entity Recognition with Noisy-label Learning

WikiSER

Models

Train with Self-regularization

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages