Merge pull request #3 from vincent-laurent/main

[DEV] major update
eurobios-mews-labs · Jul 30, 2024 · 6c5cc3a · 6c5cc3a
2 parents 76c409d + cf7822a
commit 6c5cc3a
Show file tree

Hide file tree

Showing 27 changed files with 896 additions and 466 deletions.
diff --git a/.github/workflows/pytest.yml b/.github/workflows/pytest.yml
@@ -33,6 +33,7 @@ jobs:
       run: |
         python -m pip install --upgrade pip
         pip install pytest pytest-cov
+        pip install git+https://github.com/modAL-python/modAL.git
         pip install .
 
         

diff --git a/.public/active_vs_passive.png b/.public/active_vs_passive.png
diff --git a/.public/example_krg.png b/.public/example_krg.png
diff --git a/README.md b/README.md
@@ -1,31 +1,26 @@
-
-# Active  Strategy for surface response estimation
-
+[![License](https://img.shields.io/badge/license-apache_2.0-blue.svg)]( https://github.com/eurobios-mews-labs/active-bagging-learning/blob/master/LICENSE)
 ![cov](https://github.com/eurobios-mews-labs/active-bagging-learning/blob/coverage-badge/coverage.svg)
 [![Maintenance](https://img.shields.io/badge/maintained%3F-yes-green.svg)](https://GitHub.com/eurobios-mews-labs/active-bagging-learning/graphs/commit-activity)
-## Installation
+# Active  Strategy for surface response estimation
+This library proposes a plug-in approach to active learning utilizing bagging techniques.
+Bagging, or bootstrap aggregating, is an ensemble learning method designed to improve
+the stability and accuracy of machine learning algorithms. By leveraging bagging, 
+we aim to enhance the efficiency of active learning strategies in approximating the target function $`f`$.
+* The objective is to approximate function $`f \in \mathcal{X} \rightarrow \mathbb{R}^n`$.
+* **Objective :** find an estimation of $`f`$, $`\hat{f}`$ in a family of measurable function $`\mathcal{F}`$ such that $` f^* = \underset{\hat{f} \in \mathcal{F}}{\text{argmin}} \|f - \hat{f} \| `$ 
+* At time $`t`$ we dispose of a set of $`n`$ evaluations $`(x_i, f(x_i))_{i\leqslant n}`$
+* All feasible points can be sampled in domain $`\mathcal{X}`$
+* This tools enable users to query new point based on uncertainty measure.
 
-```shell
-python -m pip install git+https://gitlab.eurobios.com/vlaurent/surrogate-models.git
-```
 
-## Literature 
-* **Review** [Simpson2001](https://ntrs.nasa.gov/api/citations/19990087092/downloads/19990087092.pdf)
 
-<img height="300" src="https://i.imgur.com/w571mZ7.png" width="400"/>
 
-* **Reliability** in [[Marelli2018]](https://arxiv.org/pdf/1709.01589) using polynomial chaos expansion. The problem is to find a region defined by a function $\{x ; \, g(x) \leqslant 0\}$ where $g$ is called limit state function. *Bootstrap approach to estimate variance* 
-* **Properties in multilayer percpetron network** [[Fukumizu2000]](https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.51.1885&rep=rep1&type=pdf) regression problem. Active learning : resampling trapped in local minima ? Redundancy of hidden units in active learning
-* Gaussian process using mutual information 
-* **Surface response methodology** [[Bezerra2008]](https://d1wqtxts1xzle7.cloudfront.net/45518928/Response_Surface_Methodology_RSM_as_a_20160510-11788-z5s7f4-with-cover-page-v2.pdf?Expires=1647600354&Signature=FWuGdH4xQIPYbo6gjfofYOvSiNCZknuwktVpgOuRU0wbBAjHhrN2a2cYCoLaqFmhLzuJNl~TeX2iXFh7rYFlAfgBwqQh6-lV29XxuU6AJTqj6lkP2MaIMHke4RMcJ6mJN39lXcfg6Ohf5D9TnD7v-Eze4fHCHbklEk9REPok6O0V3MIvx7A4XriV5Tffe5yu1HZ1fCuHBULS5PiRyuRBzKavclvPFQBPDWx5-J~y9a85oB6JGcey3VId7fvtfRUGXXn49WqHm3fJfqpLbYj62drFGjE6XcmBWm1CzBn0Guaf~ig8k6JfI9wOrErxofAkR8tjnd51VUAelB0XCY4v1A__&Key-Pair-Id=APKAJLOHF5GGSLRBV4ZA) based on linear models
-## Context 
+## Installation
 
-Plug in approach to active learning for surface response estimation
+```shell
+python -m pip install git+https://github.com/eurobios-mews-labs/active-bagging-learning.git
+```
 
-* The objective is to approximate function $`f \in \mathcal{X} \rightarrow \mathbb{R}^n`$.
-* **Objective :** find an estimation of $`f`$, $`\hat{f}`$ in a family of measurable function $`\mathcal{F}`$ such that $` f^* = \underset{\hat{f} \in \mathcal{F}}{\text{argmin}} \|f - \hat{f} \| `$ 
-* At time $`t`$ we dispose of a set of $`n`$ evaluations $`(x_i, f(x_i))_{i\leqslant n}`$
-* All feasible points can be sampled in domain $`\mathcal{X}`$
 
 ## Basic usage
 
@@ -35,35 +30,37 @@ import numpy as np
 import pandas as pd
 from sklearn.ensemble import ExtraTreesRegressor
 
-from active_learning import ActiveSRLearner
-from active_learning.components.active_criterion import ServiceVarianceEnsembleMethod
+from active_learning import ActiveSurfaceLearner
+from active_learning.components.active_criterion import VarianceEnsembleMethod
 from active_learning.components.query_strategies import ServiceQueryVariancePDF
 from active_learning.benchmark import functions
 
-fun = functions.grammacy_lee_2009                  # The function we want to learn
-bounds = np.array(functions.bounds[fun])           # [x1 bounds, x2 bounds]
+fun = functions.grammacy_lee_2009  # The function we want to learn
+bounds = np.array(functions.bounds[fun])  # [x1 bounds, x2 bounds]
 n = 50
 X_train = pd.DataFrame(
     {'x1': (bounds[0, 0] - bounds[0, 1]) * np.random.rand(n) + bounds[0, 1],
      'x2': (bounds[1, 0] - bounds[1, 1]) * np.random.rand(n) + bounds[1, 1],
-     })                                             # Initiate distribution
+     })  # Initiate distribution
 y_train = -fun(X_train)
 
-active_criterion = ServiceVarianceEnsembleMethod(           # Parameters to be used to estimate the surface response
-        estimator=ExtraTreesRegressor(              # Base estimator for the surface
-            max_features=0.8, bootstrap=True)
+active_criterion = VarianceEnsembleMethod(  # Parameters to be used to estimate the surface response
+    estimator=ExtraTreesRegressor(  # Base estimator for the surface
+        max_features=0.8, bootstrap=True)
 )
 query_strategy = ServiceQueryVariancePDF(bounds, num_eval=int(20000))
 
 # QUERY NEW POINTS
-active_learner = ActiveSRLearner(
-    active_criterion,                               # Active criterion yields a surface
-    query_strategy,                                 # Given active criterion surface, execute query 
-    X_train,                                        # Input data X
-    y_train,                                        # Input data y (target)
+active_learner = ActiveSurfaceLearner(
+    active_criterion,  # Active criterion yields a surface
+    query_strategy,  # Given active criterion surface, execute query 
     bounds=bounds)
 
-X_new = active_learner.query(3)                     # Request 3 points
+active_learner.fit(
+    X_train,  # Input data X
+    y_train)  # Input data y (target))
+
+X_new = active_learner.query(3)  # Request 3 points
 ```
 To use the approach, one has to dispose of
 
@@ -79,7 +76,7 @@ To use the approach, one has to dispose of
 
 * 1D example :  
 
-<img alt="benchmark" height="500" src=".public/example_krg.png" width="800"/>
+<img alt="benchmark" height="800" src=".public/example_krg.png"/>
 
 ## Benchmark
 

diff --git a/active_learning/__init__.py b/active_learning/__init__.py
@@ -9,7 +9,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-from active_learning.base import ActiveSRLearner
+from active_learning.base import ActiveSurfaceLearner
 from active_learning.components import query_strategies
 from active_learning.components import active_criterion
 
diff --git a/active_learning/base.py b/active_learning/base.py
@@ -9,62 +9,50 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-import numpy as np
 import pandas as pd
-from copy import deepcopy
 
 from active_learning.components.active_criterion import IActiveCriterion
 from active_learning.components.query_strategies import IQueryStrategy
 
 
-class ActiveSRLearner:
+class ActiveSurfaceLearner:
 
     def __init__(
             self,
             active_criterion: IActiveCriterion,
             query_strategy: IQueryStrategy,
-            X_train: pd.DataFrame,
-            y_train: pd.DataFrame,
             bounds=None,
     ):
-        self.active_criterion = active_criterion
-        self.query_strategy = query_strategy
-        self.x_input = X_train.copy()
-        self.y_input = y_train.copy()
-        self.bounds = bounds
-        self.result = {}
-        self.iter = 0
-        self.budget = len(X_train)
-        self.x_input.index = 0 * np.ones(len(self.x_input))
-        self.x_new = pd.DataFrame()
+        self.__active_criterion = active_criterion
+        self.__query_strategy = query_strategy
+        self.__bounds = bounds
 
-    def learn(self):
-        self.active_criterion.fit(
-            self.x_input,
-            self.y_input)
+    def fit(self, X: pd.DataFrame, y):
+        self.active_criterion.fit(X, y)
+        self.__columns = X.columns
 
-    def query(self, *args):
-        self.learn()
-        self.query_strategy.set_bounds(self.bounds)
+    def query(self, *args) -> pd.DataFrame:
+        self.query_strategy.set_bounds(self.__bounds)
         self.query_strategy.set_active_function(self.active_criterion.__call__)
-        self.x_new = pd.DataFrame(self.query_strategy.query(*args), columns=self.x_input.columns)
-        self.save()
+        x_new = pd.DataFrame(self.query_strategy.query(*args), columns=self.__columns)
+        return x_new
 
-        return self.x_new
+    @property
+    def active_criterion(self) -> IActiveCriterion:
+        return self.__active_criterion
 
-    def add_labels(self, x: pd.DataFrame, y: pd.DataFrame):
-        self.iter += 1
-        x.index = self.iter * np.ones(len(x))
-        y.index = self.iter * np.ones(len(x))
-        self.x_input = pd.concat((x, self.x_input), axis=0)
-        self.y_input = pd.concat((y, self.y_input), axis=0)
-        self.budget = len(self.x_input)
+    @property
+    def query_strategy(self) -> IQueryStrategy:
+        return self.__query_strategy
 
-    def save(self):
+    @property
+    def surface(self) -> callable:
+        return self.__active_criterion.function
 
-        self.result[self.iter] = dict(
-            surface=deepcopy(self.active_criterion.function),
-            active_criterion=deepcopy(self.active_criterion),
-            budget=int(self.budget),
-            data=self.x_input
-        )
+    @property
+    def predict(self) -> callable:
+        return self.__active_criterion.function
+
+    @property
+    def bounds(self) -> iter:
+        return self.__bounds
diff --git a/active_learning/benchmark/analyse.py b/active_learning/benchmark/analyse.py
@@ -29,7 +29,7 @@
 from active_learning.components import query_strategies
 from active_learning.components.active_criterion import VarianceBis
 from active_learning.components.sampling import latin_square
-from active_learning.benchmark.test import TestingClass
+from active_learning.benchmark.base import TestingClass
 
 name = "grammacy_lee_2009_rand"
 fun = functions.__dict__[name]
@@ -63,37 +63,37 @@ def get_method_for_benchmark(name):
         crit = query_strategies.ServiceReject(num_eval=100)
 
     elif name == "branin":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
 
     elif name == "branin_rand":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
     elif name == "himmelblau":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
 
     elif name == "himmelblau_rand":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
 
     elif name == "synthetic_2d_1":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
 
     elif name == "synthetic_2d_2":
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ExtraTreesRegressor(bootstrap=True,
                                                    max_samples=0.9))
         crit = query_strategies.ServiceQueryVariancePDF(num_eval=1000)
 
     else:
-        est = active_criterion.ServiceVarianceEnsembleMethod(
+        est = active_criterion.VarianceEnsembleMethod(
             estimator=ensemble.ServiceExtraTreesRegressor(bootstrap=True,
                                                    max_samples=0.9,
                                                    max_features=1))