Containers in Research Workflows: Reproducibility and Granularity
-Last updated on 2024-08-01 | +
Last updated on 2024-08-16 | Edit this page
@@ -423,24 +423,33 @@Work in progress…By reproducibility here we mean the ability of someone else
(or your future self) being able to reproduce what you did
computationally at a particular time (be this in research, analysis or
-something else) as closely as possible even if they do not have access
+something else) as closely as possible, even if they do not have access
to exactly the same hardware resources that you had when you did the
original work.
+What makes this especially important? With research being
+increasingly digital in nature, more and more of our research outputs
+are a result of the use of software and data processing or analysis.
+With complex software stacks or groups of dependencies often being
+required to run research software, we need approaches to ensure that we
+can make it as easy as possible to recreate an environment in which a
+given research process was undertaken. There many reasons why this
+matters, one example being someone wanting to reproduce the results of a
+publication in order to verify them and then build on that research.
Some examples of why containers are an attractive technology to help
with reproducibility include:
-- The same computational work can be run across multiple different
-technologies seamlessly (e.g. Windows, macOS, Linux).
+- The same computational work can be run seamlessly on different
+operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational
work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies
in the container image.
-- You can access legacy versions of software and underlying
+
- You can provide access to legacy versions of software and underlying
dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of
key data within the container image.
-- You can archive and share the container image as well as associating
-a persistent identifier with a container image to allow other
-researchers to reproduce and build on your work.
+- You can archive and share a container image as well as associating a
+persistent identifier with it, to allow other researchers to reproduce
+and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for
@@ -448,8 +457,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Container Granularity
As mentioned above, one of the decisions you may need to make when
@@ -548,7 +575,7 @@
Positives and negatives
Show me the solution
-
+
This is not an exhaustive list but some of the advantages and
disadvantages could be:
@@ -689,7 +716,7 @@ Key Points
Containers in Research Workflows: Reproducibility and Granularity
- Last updated on 2024-08-01 |
+
Last updated on 2024-08-16 |
Edit this page
@@ -421,24 +421,33 @@ Work in progress…By reproducibility here we mean the ability of someone else
(or your future self) being able to reproduce what you did
computationally at a particular time (be this in research, analysis or
-something else) as closely as possible even if they do not have access
+something else) as closely as possible, even if they do not have access
to exactly the same hardware resources that you had when you did the
original work.
+What makes this especially important? With research being
+increasingly digital in nature, more and more of our research outputs
+are a result of the use of software and data processing or analysis.
+With complex software stacks or groups of dependencies often being
+required to run research software, we need approaches to ensure that we
+can make it as easy as possible to recreate an environment in which a
+given research process was undertaken. There many reasons why this
+matters, one example being someone wanting to reproduce the results of a
+publication in order to verify them and then build on that research.
Some examples of why containers are an attractive technology to help
with reproducibility include:
-- The same computational work can be run across multiple different
-technologies seamlessly (e.g. Windows, macOS, Linux).
+- The same computational work can be run seamlessly on different
+operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational
work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies
in the container image.
-- You can access legacy versions of software and underlying
+
- You can provide access to legacy versions of software and underlying
dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of
key data within the container image.
-- You can archive and share the container image as well as associating
-a persistent identifier with a container image to allow other
-researchers to reproduce and build on your work.
+- You can archive and share a container image as well as associating a
+persistent identifier with it, to allow other researchers to reproduce
+and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for
@@ -446,8 +455,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Container Granularity
As mentioned above, one of the decisions you may need to make when
@@ -546,7 +573,7 @@
Positives and negatives
Show me the solution
-
- The same computational work can be run seamlessly on different +operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies in the container image. -
- You can access legacy versions of software and underlying +
- You can provide access to legacy versions of software and underlying dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of key data within the container image. -
- You can archive and share the container image as well as associating -a persistent identifier with a container image to allow other -researchers to reproduce and build on your work. +
- You can archive and share a container image as well as associating a +persistent identifier with it, to allow other researchers to reproduce +and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for @@ -448,8 +457,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.Container Granularity
As mentioned above, one of the decisions you may need to make when @@ -548,7 +575,7 @@
Positives and negatives
Show me the solution
-
+
This is not an exhaustive list but some of the advantages and
disadvantages could be:
@@ -689,7 +716,7 @@ Key Points
Containers in Research Workflows: Reproducibility and Granularity
- Last updated on 2024-08-01 |
+
Last updated on 2024-08-16 |
Edit this page
@@ -421,24 +421,33 @@ Work in progress…By reproducibility here we mean the ability of someone else
(or your future self) being able to reproduce what you did
computationally at a particular time (be this in research, analysis or
-something else) as closely as possible even if they do not have access
+something else) as closely as possible, even if they do not have access
to exactly the same hardware resources that you had when you did the
original work.
+What makes this especially important? With research being
+increasingly digital in nature, more and more of our research outputs
+are a result of the use of software and data processing or analysis.
+With complex software stacks or groups of dependencies often being
+required to run research software, we need approaches to ensure that we
+can make it as easy as possible to recreate an environment in which a
+given research process was undertaken. There many reasons why this
+matters, one example being someone wanting to reproduce the results of a
+publication in order to verify them and then build on that research.
Some examples of why containers are an attractive technology to help
with reproducibility include:
-- The same computational work can be run across multiple different
-technologies seamlessly (e.g. Windows, macOS, Linux).
+- The same computational work can be run seamlessly on different
+operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational
work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies
in the container image.
-- You can access legacy versions of software and underlying
+
- You can provide access to legacy versions of software and underlying
dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of
key data within the container image.
-- You can archive and share the container image as well as associating
-a persistent identifier with a container image to allow other
-researchers to reproduce and build on your work.
+- You can archive and share a container image as well as associating a
+persistent identifier with it, to allow other researchers to reproduce
+and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for
@@ -446,8 +455,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Container Granularity
As mentioned above, one of the decisions you may need to make when
@@ -546,7 +573,7 @@
Positives and negatives
Show me the solution
-
This is not an exhaustive list but some of the advantages and disadvantages could be:
@@ -689,7 +716,7 @@Key Points
Containers in Research Workflows: Reproducibility and Granularity
-Last updated on 2024-08-01 | +
Last updated on 2024-08-16 | Edit this page
@@ -421,24 +421,33 @@Work in progress…By reproducibility here we mean the ability of someone else
(or your future self) being able to reproduce what you did
computationally at a particular time (be this in research, analysis or
-something else) as closely as possible even if they do not have access
+something else) as closely as possible, even if they do not have access
to exactly the same hardware resources that you had when you did the
original work.
+What makes this especially important? With research being
+increasingly digital in nature, more and more of our research outputs
+are a result of the use of software and data processing or analysis.
+With complex software stacks or groups of dependencies often being
+required to run research software, we need approaches to ensure that we
+can make it as easy as possible to recreate an environment in which a
+given research process was undertaken. There many reasons why this
+matters, one example being someone wanting to reproduce the results of a
+publication in order to verify them and then build on that research.
Some examples of why containers are an attractive technology to help
with reproducibility include:
-- The same computational work can be run across multiple different
-technologies seamlessly (e.g. Windows, macOS, Linux).
+- The same computational work can be run seamlessly on different
+operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational
work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies
in the container image.
-- You can access legacy versions of software and underlying
+
- You can provide access to legacy versions of software and underlying
dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of
key data within the container image.
-- You can archive and share the container image as well as associating
-a persistent identifier with a container image to allow other
-researchers to reproduce and build on your work.
+- You can archive and share a container image as well as associating a
+persistent identifier with it, to allow other researchers to reproduce
+and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for
@@ -446,8 +455,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Container Granularity
As mentioned above, one of the decisions you may need to make when
@@ -546,7 +573,7 @@
Positives and negatives
Show me the solution
-
- The same computational work can be run seamlessly on different +operating systems (e.g. Windows, macOS, Linux).
- You can save the exact process that you used for your computational work (rather than relying on potentially incomplete notes).
- You can save the exact versions of software and their dependencies in the container image. -
- You can access legacy versions of software and underlying +
- You can provide access to legacy versions of software and underlying dependencies which may not be generally available any more.
- Depending on their size, you can also potentially store a copy of key data within the container image. -
- You can archive and share the container image as well as associating -a persistent identifier with a container image to allow other -researchers to reproduce and build on your work. +
- You can archive and share a container image as well as associating a +persistent identifier with it, to allow other researchers to reproduce +and build on your work.
Sharing images
As we have already seen, the Docker Hub provides a platform for @@ -446,8 +455,8 @@
Work in progress…BASH
- When you publish work (in whatever way) use an archiving and DOI
service such as Zenodo to make sure your container image is captured as
-it was used for the work and that is obtains a persistent DOI to allow
-it to be cited and referenced properly.
+it was used for the work and that it is assigned a persistent DOI to
+allow it to be cited and referenced properly.
+- Make use of tags when naming your container images, this ensures
+that if you update the image in future, previous versions can be
+retained within a container repository to be easily accessed, if this is
+required.
+- A built and archived container image can ensure a persistently
+bundled set of software and dependecies. However, a
+
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.
Dockerfile
provides a lightweight means of storing a
+container definition that can be used to re-create a container image at
+a later time. If you’re taking this approach, ensure that you specify
+software package and dependency versions within your
+Dockerfile
rather than just specifying package names which
+will generally install the most up-to-date version of a package. This
+may be incompatible with other elements of your software stack. Also
+note that storing only a Dockerfile
presents
+reproducibility challenges because required versions of packages may not
+be available indefinitely, potentially meaning that you’re unable to
+reproduce the required environment and, hence, the research
+results.Container Granularity
As mentioned above, one of the decisions you may need to make when @@ -546,7 +573,7 @@