Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new Tutorial - Using Ludwig Experiment to build Image Classifier from MNIST dataset #5333

Open
wants to merge 28 commits into
base: main
Choose a base branch
from

Conversation

paulocilasjr
Copy link

  • It adds a full tutorial on statistics/tutorials/galaxy-ludwig

  • The images for the tutorial were at statistics/images/galaxy-ludwig

  • Added 2 new contributors and edited 1 contributor (CONTRIBUTORS.yaml)

  • Added 1 organization (ORGANISATIONS.yaml)

HELP WITH:

Copy link
Member

@shiltemann shiltemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @paulocilasjr, looks great!

> <comment-title>Galaxy-Ludwig Tool</comment-title>
>
> The Ludwig tool described in this tutorial is only available at:
> ```https://cancer.usegalaxy.org/```
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason for this? I don't see the tool in the tool shed either, are there plans to add this? Adding the tool to IUC would be ideal, so that other Galaxies can also support this tutorial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

secondly, you probably want that url to be a proper hyperlink, not a comment

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tomorrow, I'm going to have a more precise answer regarding the point you raised.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the toll in the tools shed (https://toolshed.g2.bx.psu.edu/view/paulo_lyra_jr/ludwig_applications/3e565bbe8b71)

Not sure we can get this tool (+ original Galaxy-ML tools) into the IUC yet.

> The digits have been size normalized and centered in a fixed-size image.
{: .comment}

# FILES FORMAT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't use all caps for the section headings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No problem! I will change them all.


After your model is trained and tested, you should see three new files in your history list:

> Ludwig Experiment Report: An HTML file containing the evaluation report of the trained model.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could look better as a markdown list rather than comment block

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed it to list.

To accomplish this, three steps are needed: (i) upload Ludwig files and image files to Galaxy (ii) Set up and running the Ludwig experiment function on Galaxy, and (iii) Evaluate the image classification model. As a bonus step, we'll also explore (iv) improving the model's classification performance (Figure 1).

![schema of the whole process of training model and test.](../../images/galaxy-ludwig/explain_model_schema.png "Overview of the steps process to obtain the handwritten classification model and testing it.")
<!-- You may want to cite some publications; this can be done by adding citations to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please feel free to remove these autogenerated hints

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Just removed them.


Briefly, the image_path column provides the file paths to the images that will be fed into the deep learning algorithm. The label column contains the correct classifications, ranging from 0 to 9, for the handwritten digits in the images. The split column indicates whether the data should be used for training (0) or testing (2) the model.

![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just as a tip, since this image is rendering perhaps larger than you want, you can add some inline styling to the end like so: (have not tested if 50% is a good value, but just as an example)

Suggested change
![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.")
![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split."){: width="50%"}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This is great to know. I will test it and submit it with the best configuration.

> - {% icon param-file %} *"Input dataset"*: `mnist_dataset.csv`
> - {% icon param-file %} *"Raw data"*: `mnist_images.zip`
>
> > <comment-title> short description </comment-title>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you probably want to update or remove this comment title

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

furthermore, I think we can assume people have enough familiarity with Galaxy to know to push the "RUN TOOL" button, so the box can also just be removed completely

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I removed it all.

@paulocilasjr
Copy link
Author

Is there anything else I need to do in order to have it merged? Thank you.

@bgruening
Copy link
Member

@paulocilasjr I think its in a good shape! Thanks for your contribution. We are just busy preparing all the tutorials for the big GTA event https://training.galaxyproject.org/training-material/events/galaxy-academy-2024.html

You could maybe work on the lining issues. The workflow is missing tests etc.. Let us know if you need help with that.

@paulocilasjr
Copy link
Author

Wishing you all the best for the event — it already looks great!

By the way, I tried to add what was missing.

Copy link
Member

@anuprulez anuprulez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paulocilasjr thank you for the tutorial. I have added a few comments.

# Files Format
Before starting our hands-on, here is a brief explanation of the three files generated for the Ludwig Experiment tool.

## Image_Files.zip
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to have .zip in the header of this section? Can we write Images or Training images or something similar? Does .zip has some significance? Similar comment for MNIST_dataset.csv

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your point is valid.
I removed the files extension from the header and changed the titles.


The rationale on how this file was constructed for this dataset is the following:
i) The model takes images as input and uses a stacked convolutional neural network (CNN) to extract features.
ii) It consists of two convolutional layers followed by a fully connected layer, with dropout applied to both the second convolutional layer and the fully connected layer to reduce overfitting.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have other layers such as maxpooling and normalisation used in the architecture of stacked CNN? If yes, please mention them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the text for max pooling, set by the pool_size and pool_stride attributes. No normalization was configured for this run.

Based on the training curve (blue line), we can draw the following conclusions:

- CONSISTENT IMPROVEMENT:
Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

decreases

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Fixed.

- CONSISTENT IMPROVEMENT:
Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset.
- APPROACHING CONVERGENCE:
By Epoch 5, the training loss has reduced to less than 0.5. The gradual reduction in losses suggests that the model is approaching convergence, where further training may yield diminishing returns in terms of performance improvement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know something about the test loss? Observing test/validation loss is also important to decide whether a model is generalising well or not

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional information point to highlight this result and its conclusion.

Based on the training curve (blue line), we can draw the following conclusions:

- THE DATASET IS RELATIVELY EASY FOR THE MODEL TO LEARN:
The model starts with a relatively high training accuracy of >0.8.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Accuracy on test data should also be mentioned or cross-validation accuracy to decide a model really learns well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added an additional information point to highlight this result and its conclusion.


# Galaxy-Ludwig Tool

Ludwig simplifies the complexities of machine learning by automating essential steps such as data preprocessing, model architecture selection, hyperparameter tuning, and device management. This streamlined approach is particularly beneficial for Galaxy users who are more interested in addressing their scientific questions than in navigating the intricacies of machine learning workflows.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To automatically do data preprocessing, model architecture selection, hyperparameter tuning, and device management, what parameters should a user consider changing/updating based on a different dataset?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some lines explaining and providing examples of modifications users can make to the configuration to better fit their dataset and produce a specific model.

Copy link
Member

@anuprulez anuprulez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a few minor comments. Please look for similar issues throughout the tutorial. I am trying to run this tutorial on cancer.usegalaxy.org. I may have a few more comments after I have finished running it.
Thanks!

topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved
Copy link
Member

@anuprulez anuprulez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I executed the tutorial and it works fine.

I have two questions:

  • How to create input datasets? For example, if I am working on sequences, how can I use Ludwig experiment tool to train a model? Or the tool is only for images?

  • How can I change the model from CNN to say RNN? Do I need to change the type parameter in the config.yaml file?

Thank you!


> <hands-on-title> Task description </hands-on-title>
>
> 1. {% tool [Ludwig Experiment](ludwig_experiment) %} with the following parameters:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding tool's version would be nice! You can look at other ML tutorials such as deep learning or regression ones for example.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. I just added.

paulocilasjr and others added 2 commits October 17, 2024 14:18
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
paulocilasjr and others added 7 commits October 17, 2024 14:20
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
@paulocilasjr
Copy link
Author

Thank you for such great feedback.

Answering your questions:

  • How to create input datasets? For example, if I am working on sequences, how can I use Ludwig experiment tool to train a model? Or the tool is only for images?

A: Ludwig supports other data types.
Looking specific for inputs formats, these are stablished when setting the input_features in the config.yaml file.

Giving a sequence data for example, the dataset could be a CSV with the following data: Reference_allele, Alternative_allele, mutation_type, and Clinical_Impact. This means you have three input features and one target label (Clinical_Impact).
Your config.yml file should look something like this (omitting other parameters to keep it short):

input_features:

  • name: Reference_allele
    type: category
    enconder:
  • name: Alternative_allele
    type: category
    encoder:
  • name: mutation_type
    type: category
    encoder:

output_features:

  • name: Clinical_Impact
    type: category
  • How can I change the model from CNN to say RNN? Do I need to change the type parameter in the config.yaml file?

A: The type corresponds to one of the supported data types. The encoder is what you’re looking for—each input feature can be configured with a specific encoder, such as an RNN.

Let me know if anything could be explained more clearly.

@anuprulez
Copy link
Member

Ok, looks good to me. @paulocilasjr can you resolve the conflicts in this branch by probably rebasing?

@anuprulez
Copy link
Member

I think we have some failing tests. Can you see if they are related? thanks!

@paulocilasjr
Copy link
Author

I addressed the issues.

Is it possible to run the tests locally? Could you tell me how to do it or where I can read about it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants