-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new Tutorial - Using Ludwig Experiment to build Image Classifier from MNIST dataset #5333
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @paulocilasjr, looks great!
> <comment-title>Galaxy-Ludwig Tool</comment-title> | ||
> | ||
> The Ludwig tool described in this tutorial is only available at: | ||
> ```https://cancer.usegalaxy.org/``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason for this? I don't see the tool in the tool shed either, are there plans to add this? Adding the tool to IUC would be ideal, so that other Galaxies can also support this tutorial.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
secondly, you probably want that url to be a proper hyperlink, not a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tomorrow, I'm going to have a more precise answer regarding the point you raised.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the toll in the tools shed (https://toolshed.g2.bx.psu.edu/view/paulo_lyra_jr/ludwig_applications/3e565bbe8b71)
Not sure we can get this tool (+ original Galaxy-ML tools) into the IUC yet.
> The digits have been size normalized and centered in a fixed-size image. | ||
{: .comment} | ||
|
||
# FILES FORMAT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't use all caps for the section headings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No problem! I will change them all.
|
||
After your model is trained and tested, you should see three new files in your history list: | ||
|
||
> Ludwig Experiment Report: An HTML file containing the evaluation report of the trained model. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could look better as a markdown list rather than comment block
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to list.
To accomplish this, three steps are needed: (i) upload Ludwig files and image files to Galaxy (ii) Set up and running the Ludwig experiment function on Galaxy, and (iii) Evaluate the image classification model. As a bonus step, we'll also explore (iv) improving the model's classification performance (Figure 1). | ||
|
||
![schema of the whole process of training model and test.](../../images/galaxy-ludwig/explain_model_schema.png "Overview of the steps process to obtain the handwritten classification model and testing it.") | ||
<!-- You may want to cite some publications; this can be done by adding citations to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please feel free to remove these autogenerated hints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Just removed them.
|
||
Briefly, the image_path column provides the file paths to the images that will be fed into the deep learning algorithm. The label column contains the correct classifications, ranging from 0 to 9, for the handwritten digits in the images. The split column indicates whether the data should be used for training (0) or testing (2) the model. | ||
|
||
![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a tip, since this image is rendering perhaps larger than you want, you can add some inline styling to the end like so: (have not tested if 50% is a good value, but just as an example)
![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.") | |
![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split."){: width="50%"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! This is great to know. I will test it and submit it with the best configuration.
> - {% icon param-file %} *"Input dataset"*: `mnist_dataset.csv` | ||
> - {% icon param-file %} *"Raw data"*: `mnist_images.zip` | ||
> | ||
> > <comment-title> short description </comment-title> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to update or remove this comment title
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
furthermore, I think we can assume people have enough familiarity with Galaxy to know to push the "RUN TOOL" button, so the box can also just be removed completely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. I removed it all.
Is there anything else I need to do in order to have it merged? Thank you. |
@paulocilasjr I think its in a good shape! Thanks for your contribution. We are just busy preparing all the tutorials for the big GTA event https://training.galaxyproject.org/training-material/events/galaxy-academy-2024.html You could maybe work on the lining issues. The workflow is missing tests etc.. Let us know if you need help with that. |
Wishing you all the best for the event — it already looks great! By the way, I tried to add what was missing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@paulocilasjr thank you for the tutorial. I have added a few comments.
# Files Format | ||
Before starting our hands-on, here is a brief explanation of the three files generated for the Ludwig Experiment tool. | ||
|
||
## Image_Files.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to have .zip
in the header of this section? Can we write Images
or Training images
or something similar? Does .zip
has some significance? Similar comment for MNIST_dataset.csv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your point is valid.
I removed the files extension from the header and changed the titles.
|
||
The rationale on how this file was constructed for this dataset is the following: | ||
i) The model takes images as input and uses a stacked convolutional neural network (CNN) to extract features. | ||
ii) It consists of two convolutional layers followed by a fully connected layer, with dropout applied to both the second convolutional layer and the fully connected layer to reduce overfitting. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have other layers such as maxpooling
and normalisation
used in the architecture of stacked CNN? If yes, please mention them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the text for max pooling, set by the pool_size and pool_stride attributes. No normalization was configured for this run.
Based on the training curve (blue line), we can draw the following conclusions: | ||
|
||
- CONSISTENT IMPROVEMENT: | ||
Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
decreases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Fixed.
- CONSISTENT IMPROVEMENT: | ||
Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset. | ||
- APPROACHING CONVERGENCE: | ||
By Epoch 5, the training loss has reduced to less than 0.5. The gradual reduction in losses suggests that the model is approaching convergence, where further training may yield diminishing returns in terms of performance improvement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know something about the test loss
? Observing test/validation loss is also important to decide whether a model is generalising well or not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an additional information point to highlight this result and its conclusion.
Based on the training curve (blue line), we can draw the following conclusions: | ||
|
||
- THE DATASET IS RELATIVELY EASY FOR THE MODEL TO LEARN: | ||
The model starts with a relatively high training accuracy of >0.8. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Accuracy on test data should also be mentioned or cross-validation accuracy to decide a model really learns well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an additional information point to highlight this result and its conclusion.
|
||
# Galaxy-Ludwig Tool | ||
|
||
Ludwig simplifies the complexities of machine learning by automating essential steps such as data preprocessing, model architecture selection, hyperparameter tuning, and device management. This streamlined approach is particularly beneficial for Galaxy users who are more interested in addressing their scientific questions than in navigating the intricacies of machine learning workflows. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To automatically do data preprocessing, model architecture selection, hyperparameter tuning, and device management
, what parameters should a user consider changing/updating based on a different dataset?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some lines explaining and providing examples of modifications users can make to the configuration to better fit their dataset and produce a specific model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a few minor comments. Please look for similar issues throughout the tutorial. I am trying to run this tutorial on cancer.usegalaxy.org. I may have a few more comments after I have finished running it.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I executed the tutorial and it works fine.
I have two questions:
-
How to create input datasets? For example, if I am working on sequences, how can I use
Ludwig experiment
tool to train a model? Or the tool is only for images? -
How can I change the model from CNN to say RNN? Do I need to change the
type
parameter in theconfig.yaml
file?
Thank you!
|
||
> <hands-on-title> Task description </hands-on-title> | ||
> | ||
> 1. {% tool [Ludwig Experiment](ludwig_experiment) %} with the following parameters: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding tool's version would be nice! You can look at other ML tutorials such as deep learning or regression ones for example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. I just added.
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>
Thank you for such great feedback. Answering your questions:
A: Ludwig supports other data types. Giving a sequence data for example, the dataset could be a CSV with the following data:
A: The Let me know if anything could be explained more clearly. |
Ok, looks good to me. @paulocilasjr can you resolve the conflicts in this branch by probably rebasing? |
I think we have some failing tests. Can you see if they are related? thanks! |
I addressed the issues. Is it possible to run the tests locally? Could you tell me how to do it or where I can read about it? |
It adds a full tutorial on statistics/tutorials/galaxy-ludwig
The images for the tutorial were at statistics/images/galaxy-ludwig
Added 2 new contributors and edited 1 contributor (CONTRIBUTORS.yaml)
Added 1 organization (ORGANISATIONS.yaml)
HELP WITH: