Add new Tutorial - Using Ludwig Experiment to build Image Classifier from MNIST dataset #5333

paulocilasjr · 2024-09-18T21:02:03Z

It adds a full tutorial on statistics/tutorials/galaxy-ludwig
The images for the tutorial were at statistics/images/galaxy-ludwig
Added 2 new contributors and edited 1 contributor (CONTRIBUTORS.yaml)
Added 1 organization (ORGANISATIONS.yaml)

HELP WITH:

add the tutorial as an option on https://training.galaxyproject.org/training-material/topics/statistics/ page

shiltemann

Thanks a lot @paulocilasjr, looks great!

shiltemann · 2024-09-19T07:44:45Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+> <comment-title>Galaxy-Ludwig Tool</comment-title>
+>
+> The Ludwig tool described in this tutorial is only available at: 
+> ```https://cancer.usegalaxy.org/```


Any reason for this? I don't see the tool in the tool shed either, are there plans to add this? Adding the tool to IUC would be ideal, so that other Galaxies can also support this tutorial.

secondly, you probably want that url to be a proper hyperlink, not a comment

Tomorrow, I'm going to have a more precise answer regarding the point you raised.

I added the toll in the tools shed (https://toolshed.g2.bx.psu.edu/view/paulo_lyra_jr/ludwig_applications/3e565bbe8b71)

Not sure we can get this tool (+ original Galaxy-ML tools) into the IUC yet.

shiltemann · 2024-09-19T07:45:15Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+> The digits have been size normalized and centered in a fixed-size image.
+{:  .comment}
+
+# FILES FORMAT


please don't use all caps for the section headings

No problem! I will change them all.

shiltemann · 2024-09-19T07:46:02Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+
+After your model is trained and tested, you should see three new files in your history list:
+
+    > Ludwig Experiment Report: An HTML file containing the evaluation report of the trained model.


I think this could look better as a markdown list rather than comment block

Changed it to list.

shiltemann · 2024-09-19T07:46:37Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+To accomplish this, three steps are needed: (i) upload Ludwig files and image files to Galaxy (ii) Set up and running the Ludwig experiment function on Galaxy, and (iii) Evaluate the image classification model. As a bonus step, we'll also explore (iv) improving the model's classification performance (Figure 1).
+
+![schema of the whole process of training model and test.](../../images/galaxy-ludwig/explain_model_schema.png "Overview of the steps process to obtain the handwritten classification model and testing it.")
+<!-- You may want to cite some publications; this can be done by adding citations to the


please feel free to remove these autogenerated hints

Thank you. Just removed them.

shiltemann · 2024-09-19T07:49:09Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+
+Briefly, the image_path column provides the file paths to the images that will be fed into the deep learning algorithm. The label column contains the correct classifications, ranging from 0 to 9, for the handwritten digits in the images. The split column indicates whether the data should be used for training (0) or testing (2) the model.
+
+![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.")


Just as a tip, since this image is rendering perhaps larger than you want, you can add some inline styling to the end like so: (have not tested if 50% is a good value, but just as an example)

Suggested change

![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.")

![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split."){: width="50%"}

Thank you! This is great to know. I will test it and submit it with the best configuration.

shiltemann · 2024-09-19T07:49:53Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+>    - {% icon param-file %} *"Input dataset"*: `mnist_dataset.csv`
+>    - {% icon param-file %} *"Raw data"*: `mnist_images.zip`
+>
+>    > <comment-title> short description </comment-title>


you probably want to update or remove this comment title

furthermore, I think we can assume people have enough familiarity with Galaxy to know to push the "RUN TOOL" button, so the box can also just be removed completely

I agree. I removed it all.

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

paulocilasjr · 2024-10-01T15:18:34Z

Is there anything else I need to do in order to have it merged? Thank you.

bgruening · 2024-10-01T20:40:32Z

@paulocilasjr I think its in a good shape! Thanks for your contribution. We are just busy preparing all the tutorials for the big GTA event https://training.galaxyproject.org/training-material/events/galaxy-academy-2024.html

You could maybe work on the lining issues. The workflow is missing tests etc.. Let us know if you need help with that.

paulocilasjr · 2024-10-02T20:08:17Z

Wishing you all the best for the event — it already looks great!

By the way, I tried to add what was missing.

anuprulez

@paulocilasjr thank you for the tutorial. I have added a few comments.

anuprulez · 2024-10-10T07:37:55Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+# Files Format
+Before starting our hands-on, here is a brief explanation of the three files generated for the Ludwig Experiment tool.
+
+## Image_Files.zip 


Do we need to have .zip in the header of this section? Can we write Images or Training images or something similar? Does .zip has some significance? Similar comment for MNIST_dataset.csv

Your point is valid.
I removed the files extension from the header and changed the titles.

anuprulez · 2024-10-10T07:40:23Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+
+The rationale on how this file was constructed for this dataset is the following:
+i) The model takes images as input and uses a stacked convolutional neural network (CNN) to extract features.
+ii) It consists of two convolutional layers followed by a fully connected layer, with dropout applied to both the second convolutional layer and the fully connected layer to reduce overfitting. 


Do we have other layers such as maxpooling and normalisation used in the architecture of stacked CNN? If yes, please mention them.

I added the text for max pooling, set by the pool_size and pool_stride attributes. No normalization was configured for this run.

anuprulez · 2024-10-10T07:42:20Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+Based on the training curve (blue line), we can draw the following conclusions:
+
+- CONSISTENT IMPROVEMENT:
+    Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset.


Thank you. Fixed.

anuprulez · 2024-10-10T07:43:50Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+- CONSISTENT IMPROVEMENT:
+    Training loss consistently decrease over the five epochs. This indicates that the model is effectively learning and improving its performance on the dataset.
+- APPROACHING CONVERGENCE:
+    By Epoch 5, the training loss has reduced to less than 0.5. The gradual reduction in losses suggests that the model is approaching convergence, where further training may yield diminishing returns in terms of performance improvement.


Do we know something about the test loss? Observing test/validation loss is also important to decide whether a model is generalising well or not

I added an additional information point to highlight this result and its conclusion.

anuprulez · 2024-10-10T07:44:51Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+Based on the training curve (blue line), we can draw the following conclusions:
+
+- THE DATASET IS RELATIVELY EASY FOR THE MODEL TO LEARN:
+    The model starts with a relatively high training accuracy of >0.8.


Accuracy on test data should also be mentioned or cross-validation accuracy to decide a model really learns well.

I added an additional information point to highlight this result and its conclusion.

anuprulez · 2024-10-10T07:46:28Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+
+# Galaxy-Ludwig Tool
+
+Ludwig simplifies the complexities of machine learning by automating essential steps such as data preprocessing, model architecture selection, hyperparameter tuning, and device management. This streamlined approach is particularly beneficial for Galaxy users who are more interested in addressing their scientific questions than in navigating the intricacies of machine learning workflows.


To automatically do data preprocessing, model architecture selection, hyperparameter tuning, and device management, what parameters should a user consider changing/updating based on a different dataset?

I added some lines explaining and providing examples of modifications users can make to the configuration to better fit their dataset and produce a specific model.

anuprulez

I have added a few minor comments. Please look for similar issues throughout the tutorial. I am trying to run this tutorial on cancer.usegalaxy.org. I may have a few more comments after I have finished running it.
Thanks!

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

anuprulez

I executed the tutorial and it works fine.

I have two questions:

How to create input datasets? For example, if I am working on sequences, how can I use Ludwig experiment tool to train a model? Or the tool is only for images?
How can I change the model from CNN to say RNN? Do I need to change the type parameter in the config.yaml file?

Thank you!

anuprulez · 2024-10-17T08:17:37Z

topics/statistics/tutorials/galaxy-ludwig/tutorial.md

+
+> <hands-on-title> Task description </hands-on-title>
+>
+> 1. {% tool [Ludwig Experiment](ludwig_experiment) %} with the following parameters:


Adding tool's version would be nice! You can look at other ML tutorials such as deep learning or regression ones for example.

Thank you. I just added.

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

paulocilasjr · 2024-10-17T19:57:13Z

Thank you for such great feedback.

Answering your questions:

How to create input datasets? For example, if I am working on sequences, how can I use Ludwig experiment tool to train a model? Or the tool is only for images?

A: Ludwig supports other data types.
Looking specific for inputs formats, these are stablished when setting the input_features in the config.yaml file.

Giving a sequence data for example, the dataset could be a CSV with the following data: Reference_allele, Alternative_allele, mutation_type, and Clinical_Impact. This means you have three input features and one target label (Clinical_Impact).
Your config.yml file should look something like this (omitting other parameters to keep it short):

input_features:

name: Reference_allele
type: category
enconder:
name: Alternative_allele
type: category
encoder:
name: mutation_type
type: category
encoder:

output_features:

name: Clinical_Impact
type: category

How can I change the model from CNN to say RNN? Do I need to change the type parameter in the config.yaml file?

A: The type corresponds to one of the supported data types. The encoder is what you’re looking for—each input feature can be configured with a specific encoder, such as an RNN.

Let me know if anything could be explained more clearly.

anuprulez · 2024-10-18T08:45:26Z

Ok, looks good to me. @paulocilasjr can you resolve the conflicts in this branch by probably rebasing?

anuprulez · 2024-10-18T14:59:56Z

I think we have some failing tests. Can you see if they are related? thanks!

paulocilasjr · 2024-10-18T16:33:54Z

I addressed the issues.

Is it possible to run the tests locally? Could you tell me how to do it or where I can read about it?

paulocilasjr and others added 2 commits September 18, 2024 16:26

Ludwig MNIST tutorial

17bd901

Merge branch 'galaxyproject:main' into main

5f1966b

paulocilasjr requested a review from a team as a code owner September 18, 2024 21:02

github-actions bot added template-and-tools statistics labels Sep 18, 2024

shiltemann reviewed Sep 19, 2024

View reviewed changes

topics/statistics/tutorials/galaxy-ludwig/tutorial.md Outdated Show resolved Hide resolved

shiltemann and others added 7 commits September 19, 2024 11:00

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

370563f

Merge branch 'galaxyproject:main' into main

991e144

Comments_resolved_1

2e2b15d

Comments_resolved_1

e34b5a8

Merge branch 'galaxyproject:main' into main

d033608

Merge branch 'galaxyproject:main' into main

180f483

Merge branch 'galaxyproject:main' into main

c92e257

paulocilasjr and others added 3 commits October 2, 2024 11:14

Merge branch 'galaxyproject:main' into main

77cdcd4

added: workflow-test, license, identifier and tool_id

d19c70c

Merge branch 'main' into main

260061e

Merge branch 'main' into main

468efc2

anuprulez reviewed Oct 10, 2024

View reviewed changes

paulocilasjr and others added 4 commits October 11, 2024 15:06

Merge branch 'galaxyproject:main' into main

b42d201

comments_resolved_V2

c9d7a7b

Merge branch 'main' into main

ae433b8

Merge branch 'main' into main

100f287

anuprulez reviewed Oct 17, 2024

View reviewed changes

paulocilasjr and others added 2 commits October 17, 2024 14:18

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

bc38dab

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

768e4fc

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

paulocilasjr and others added 7 commits October 17, 2024 14:20

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

06c845e

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

5a212fa

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

b0606c5

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

Update topics/statistics/tutorials/galaxy-ludwig/tutorial.md

b1a7f89

Co-authored-by: Anup Kumar <kumara@informatik.uni-freiburg.de>

tool version added

ebc72ab

organisations fix

76aa805

order of organisation

4f49672

conflict solved

e3e1c95

too_id fix and test_output

4cc9efe


		After your model is trained and tested, you should see three new files in your history list:

		> Ludwig Experiment Report: An HTML file containing the evaluation report of the trained model.


		Briefly, the image_path column provides the file paths to the images that will be fed into the deep learning algorithm. The label column contains the correct classifications, ranging from 0 to 9, for the handwritten digits in the images. The split column indicates whether the data should be used for training (0) or testing (2) the model.

		![Dataset.csv file format snapshot](../../images/galaxy-ludwig/explain_dataset_format.png "Dataset.csv file format snapshot. features in order: file_path, label, split.")


		# Galaxy-Ludwig Tool

		Ludwig simplifies the complexities of machine learning by automating essential steps such as data preprocessing, model architecture selection, hyperparameter tuning, and device management. This streamlined approach is particularly beneficial for Galaxy users who are more interested in addressing their scientific questions than in navigating the intricacies of machine learning workflows.

Add new Tutorial - Using Ludwig Experiment to build Image Classifier from MNIST dataset #5333

Are you sure you want to change the base?

Add new Tutorial - Using Ludwig Experiment to build Image Classifier from MNIST dataset #5333

Conversation

paulocilasjr commented Sep 18, 2024

shiltemann left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulocilasjr commented Oct 1, 2024

bgruening commented Oct 1, 2024

paulocilasjr commented Oct 2, 2024

anuprulez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anuprulez left a comment • edited Loading

Choose a reason for hiding this comment

anuprulez left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulocilasjr commented Oct 17, 2024

anuprulez commented Oct 18, 2024

anuprulez commented Oct 18, 2024

paulocilasjr commented Oct 18, 2024

anuprulez left a comment •

edited

Loading