This is an UNOFFICIAL implementation of the model described in:
Uri Alon, Shaked Brody, Omer Levy and Eran Yahav, "code2seq: Generating Sequences from Structured Representations of Code" [PDF]
This is a TensorFlow 2.1 fork of the network inplementation, with Java and C# extractors for preprocessing the input code. The official network implementation repository is https://github.com/tech-srl/code2seq
This implementation was further modified to perform research on whether including comments in the training data increases the model's performance for the task of Code Captioning. This was done as a part of Research Project 2022, as a final requirement of Bachelor's Degree at Delft University of Technology.
To that end, the following changes were made:
LeavesCollectorVisitor.java
was modified to include inline comments in AST paths.FunctionVisitor.java
now predicts JavaDoc comments instead of method names.- The model was changed to use FunCom dataset. Instructions on how to do that are included in
funcom_dataset
. - Scripts used for training and score calculation are added in
code_captioning_scripts
. - Smaller changes to accomodate the training pipeline.
python3 -c 'import tensorflow as tf; print(tf.__version__)'
- For creating a new Java dataset or manually examining a trained model (any operation that requires parsing of a new code example): JDK
- For creating a C# dataset: dotnet-core version 2.2 or newer.
pip install rouge
for computing rouge scores.
git clone https://github.com/Kolkir/code2seq/
cd code2seq
To obtain a preprocessed dataset to train a network on, you can either download preprocessed dataset, or create a new dataset from Java source files.
Download our preprocessed dataset Java-large dataset (~16M examples, compressed: 11G, extracted 125GB)
mkdir data
cd data
wget https://s3.amazonaws.com/code2seq/datasets/java-large-preprocessed.tar.gz
tar -xvzf java-large-preprocessed.tar.gz
This will create a data/java-large/
sub-directory, containing the files that hold training, test and validation sets,
and a dict file for various dataset properties.
To create and preprocess a new dataset (for example, to compare code2seq to another model on another dataset):
- Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
- Run the preprocess.sh file:
bash preprocess.sh
To train a model from scratch:
- Edit the file train.sh to point it to the right preprocessed data. By default, it points to our "java-large" dataset that was preprocessed in the previous step.
- Before training, you can edit the configuration hyper-parameters in the file config.py, as explained in Configuration.
- Run the train.sh script:
bash train.sh
After config.PATIENCE
iterations of no improvement on the validation set, training stops by itself.
Suppose that iteration #52 is our chosen model, run:
python3 code2seq.py --load_path models/java-large-model/model_iter52.release --test data/java-large/java-large.test.c2s
While evaluating, a file named "log.txt" is written to the same dir as the saved models, with each test example name and the model's prediction.
To manually examine a trained model, run:
python3 code2seq.py --load_path models/java-large-model/model_iter52.release --predict
After the model loads, follow the instructions and edit the file Input.java
and enter a Java
method or code snippet, and examine the model's predictions and attention scores.
Due to TensorFlow's limitations, if using beam search (config.BEAM_WIDTH > 0
), then BEAM_WIDTH
hypotheses will be printed, but
without attention weights. If not using beam search (config.BEAM_WIDTH == 0
), then a single hypothesis will be printed with
the attention weights in every decoding timestep.
Changing hyper-parameters is possible by editing the file config.py.
Here are some of the parameters and their description:
The max number of epochs to train the model.
The frequency, in epochs, of saving a model and evaluating on the validation set during training.
Controlling early stopping: how many epochs of no improvement should training continue before stopping.
Batch size during training and inference.
The buffer size that the reader uses for shuffling the training data. Controls the randomness of the data. Increasing this value might hurt training throughput.
The buffer size (in bytes) of the CSV dataset reader.
The number of contexts to sample in each example during training (resampling a different subset of this size every training iteration).
The max size of the subtoken vocabulary.
The max size of the target words vocabulary.
Embedding size for subtokens, AST nodes and target symbols.
The total size of the two LSTMs that are used to embed the paths if config.BIRNN
is True
, or the size of the single LSTM if config.BIRNN
is False
.
Size of each LSTM layer in the decoder.
Number of decoder LSTM layers. Can be increased to support long target sequences.
The max number of nodes in a path
The max number of subtokens in an input token. If the token is longer, only the first subtokens will be read.
The max number of symbols in the target sequence. Set to 6 by default for method names, but can be increased for learning datasets with longer sequences.
If True, use a bidirectional LSTM to encode each path. If False, use a unidirectional LSTM only.
When True, sample MAX_CONTEXT
from every example every training iteration.
When False, take the first MAX_CONTEXTS
only.
Beam width in beam search. Inactive when 0.
If True
, use Momentum optimizer with nesterov. If False
, use Adam
(Adam converges in fewer epochs; Momentum leads to slightly better results).
This project currently supports Java and C# as the input languages.
March 2020 - a code2seq extractor for C++. See: https://github.com/Kolkir/cppminer.
January 2020 - a code2seq extractor for Python (specifically targeting the Python150k dataset) was contributed by @stasbel. See: https://github.com/tech-srl/code2seq/tree/master/Python150kExtractor.
January 2020 - an extractor for predicting TypeScript type annotations for JavaScript input using code2vec was developed by @izosak and Noa Cohen, and is available here: https://github.com/tech-srl/id2vec.
June 2019 - an extractor for C that is compatible with our model was developed by CMU SEI team. - removed by CMU SEI team.
June 2019 - a code2vec extractor for Python, Java, C, C++ by JetBrains Research is available here: PathMiner.
To extend code2seq to other languages other than Java and C#, a new extractor (similar to the JavaExtractor) should be implemented, and be called by preprocess.sh. Basically, an extractor should be able to output for each directory containing source files:
- A single text file, where each row is an example.
- Each example is a space-delimited list of fields, where:
- The first field is the target label, internally delimited by the "|" character (for example:
compare|ignore|case
- Each of the following field are contexts, where each context has three components separated by commas (","). None of these components can include spaces nor commas.
We refer to these three components as a token, a path, and another token, but in general other types of ternary contexts can be considered.
Each "token" component is a token in the code, split to subtokens using the "|" character.
Each path is a path between two tokens, split to path nodes (or other kinds of building blocks) using the "|" character. Example for a context:
my|key,StringExression|MethodCall|Name,get|value
Here my|key
and get|value
are tokens, and StringExression|MethodCall|Name
is the syntactic path that connects them.
To download the Java-small, Java-med and Java-large datasets used in the Code Summarization task as raw *.java
files, use:
To download the preprocessed datasets, use:
The C# dataset used in the Code Captioning task can be downloaded from the CodeNN repository.
Experimental C++ dataset created with cppminer tool.