PETML implements secure gradient descent decision tree model training and inference using secure multi-party computation. It currently supports training for regression problems and binary classification problems within two parties.
Gradient Boosting Decision Tree (GBDT) is a widely used statistical model for classification and regression problems. We sketch the key steps of XGBoost below and refer readers to the paper [1] for details.
According to [1], the steps of building one tree are:
-
Step 1: Based on the loss function, the first derivative
$g_i$ and the second derivative$h_i$ for each sample are calculated by ground truth label and current predicted values. Taking logistic regression as an example,$g_i= sigmoid(y_{pred_i}) - y_{truth_i}$ $ and $$h_i = sigmoid(y_{pred_i})*(1 - sigmoid(y_{pred_i}))$. -
Step 2: The sample that yields the largest gain is selected to perform the split in the current node. This is achieved by enumerating all samples under each feature. (Please refer to the Eq (7) in the paper [1].)
-
Step 3: Upon identifying the split points, the data is split into two parts, which are then allocated to the left and right subtrees respectively. The training continues to the next level, repeating Step 2 until the conditions for tree growth termination are fulfilled.
-
Step 4: The weight of the leaf node is determined by the samples that fall into it. It is computed using the leaf weight formula Eq (5) in [1].
To ensure no sample information is leaked, we use secure two-party computation protocol to implement the addition, multiplication, comparison, and other necessary operations in the training and inference algorithms of GBDT. The split feature, split points, and leaf weight values in the tree models are saved in the format of secure shares to ensure that no information is leaked. We implement Secure-XGBoost using PETAce, which provides implementations of the underlying MPC protocols.
In short, our secure-XGBoost algorithm replaces the plaintext operations of XGBoost sketched above with their corresponding secure two-party computation versions, except Step 3. In Step 3, data splitting could lead to uneven data sizes in the left and right subtrees depending on the input data distribution. To remove such leakage, we use all samples to calculate the gain at every split node in each layer of the tree.
To achieve this, we use an additional secret sharing vector which has the same length as samples and can be regarded as
the index array of the training samples, denoted as
The growth of the tree stops when it reaches the maximum depth set by the user. Note that different datasets may result in different tree structures. To prevent potential attacks based on the structure of the tree, we construct a full binary tree. If the tree cannot find the optimal split in the current node, it will use the split information from its parent node to ensure accurate results.
The key steps of building one secure tree are:
- Step 1: The first derivative
$g_i$ and the second derivative$h_i$ for each sample are computed using the secret sharing-based MPC protocol. Using logistic regression as an example, secure 2-party operations such as sigmoid, multiplication, and subtraction are used. - Step 2: The formula employed for the computation of gain incorporates secure 2-party addition, multiplication, and division at each iteration.
- Step 3: As discussed above.
- Step 4: The formula employed for the computation of leaf weight incorporates secure 2-party addition, and division.
Assume we have a dataset with only one feature
To mitigate the information leakage, all inference samples have to do secret comparisons with all decision nodes in each
tree. By sum of all the edges in the tree from the root to that node by using a secure 2-party addition protocol, we can
generate a secret vector
petml.operators.boosting.XGBoostClassifierFit
Name | Type | Description | Default |
---|---|---|---|
min_split_loss |
float | The minimum number of gain to split an internal node | 1e-5 |
learning_rate |
float | Learning rate shrinks the contribution of each tree by learning_rate
|
0.1 |
n_estimators |
int | The number of boosting stages to perform | 100 |
base_score |
float | Initial value of |
0.5 |
max_depth |
int | Maximum depth of the tree | 3 |
reg_alpha |
float | L1 regularization term on weights | 0. |
reg_lambda |
float | L2 regularization term on weights | 1. |
min_child_samples |
int | The minimum number of samples required to be at a leaf node | 1 |
min_child_weight |
float | The minimum sum of instance weight (hessian) needed in a child | 0.5 |
test_size |
float | Size of eval dataset of input data | 0.3 |
eval_epochs |
int | Calculating the evaluation metric after every certain number of epochs | 10 |
eval_threshold |
float | Regard the instances with eval prediction value larger than threshold as positive instances, and the others as negative instances | 0.5 |
objective |
str | The loss function to be optimized | logitraw |
Name | File Type | Description |
---|---|---|
train_data | csv | The training dataset |
Name | File Type | Description |
---|---|---|
model_path | json | The trained model |
config = {
"common": {
"objective": "logitraw",
"n_estimators": 10,
"max_depth": 3,
"reg_lambda": 1,
"reg_alpha": 0.0,
"min_child_weight": 0.1,
"base_score": 0.5,
"learning_rate": 0.1,
"network_mode": "petnet",
"network_scheme": "socket",
"label_name": "label",
"test_size": 0.3,
"parties": {
"party_a": {
"address": ["127.0.0.1:50011"]
},
"party_b": {
"address": ["127.0.0.1:50012"]
}
}
},
"party_a": {
"inputs": {
"train_data": "data0.csv",
},
"outputs": {
"model_path": "model_name0.json"
}
},
"party_b": {
"inputs": {
"train_data": "data1.csv",
},
"outputs": {
"model_path": "model_name1.json"
}
}
}
#if run the code in party a, the party should be 'party_a' and vice versa
operator = petml.operator.boosting.XGBoostClassifierFit(party)
operator.run(config_map)
petml.operators.boosting.XGBoostClassifierPredict
Name | File Type | Description |
---|---|---|
predict_data | csv | The inference dataset |
model_path | json | The trained model |
Name | File Type | Description |
---|---|---|
inference_res_path | csv | The result of inference |
config = {
"common": {
"network_mode": "petnet",
"network_scheme": "socket",
"parties": {
"party_a": {
"address": ["127.0.0.1:50011"]
},
"party_b": {
"address": ["127.0.0.1:50012"]
}
}
},
"party_a": {
"inputs": {
"predict_data": "data0.csv",
"model_path": "model_name0.json"
},
"outputs": {
"inference_res_path": "predict0.csv"
}
},
"party_b": {
"inputs": {
"predict_data": "data1.csv",
"model_path": "model_name1.json"
},
"outputs": {
"inference_res_path": "predict1.csv"
}
}
}
#if run the code in party a, the party should be 'party_a' and vice versa
operator = petml.operator.boosting.XGBoostClassifierPredict(party)
operator.run(config_map)
petml.operators.boosting.XGBoostRegressorFit
Name | Type | Description | Default |
---|---|---|---|
min_split_loss |
float | The minimum number of gain to split an internal node | 1e-5 |
learning_rate |
float | Learning rate shrinks the contribution of each tree by learning_rate
|
0.1 |
n_estimators |
int | The number of boosting stages to perform | 100 |
base_score |
float | Initial value of |
0.5 |
max_depth |
int | Maximum depth of the tree | 3 |
reg_alpha |
float | L1 regularization term on weights | 0. |
reg_lambda |
float | L2 regularization term on weights | 1. |
min_child_samples |
int | The minimum number of samples required to be at a leaf node | 1 |
min_child_weight |
float | The minimum sum of instance weight (hessian) needed in a child | 1 |
test_size |
float | Size of eval dataset of input data | 0.3 |
eval_epochs |
int | Calculating the evaluation metric after every certain number of epochs | 10 |
eval_threshold |
float | Regard the instances with eval prediction value larger than threshold as positive instances, and the others as negative instances | 0.5 |
objective |
str | The loss function to be optimized | squarederror |
Name | File Type | Description |
---|---|---|
train_data | csv | The training dataset |
Name | File Type | Description |
---|---|---|
model_path | json | The trained model |
Refer to the examples in classifier training config
petml.operators.boosting.XGBoostRegressorPredict
Name | File Type | Description |
---|---|---|
predict_data | csv | The inference dataset |
model_path | json | The trained model |
Name | File Type | Description |
---|---|---|
inference_res_path | csv | The result of inference |
Refer to the examples in classifier inference config