Skip to content

Latest commit

 

History

History
42 lines (31 loc) · 3.34 KB

README.md

File metadata and controls

42 lines (31 loc) · 3.34 KB

Auto Feature Engineering Workflow

Auto feature engineering targets to simplify Feature engineering process with enhanced performance via parallel data processing frameworks, automated data processing pipeline and built-in domain-specific feature engineering primitives. This repository provides an end-to-end workflow that automatically analyzes the data based on data type, profiles feature distribution, generates customizable feature engineering pipelines for the data preparation and executes the pipeline parallelly with different backend engines on Intel platform.

auto feature engineering explained

Steps explained:

  1. Feature profile: Analyze raw tabular dataset to infer original feature based on data type and generate FeatureList.
  2. Feature engineering: Use inferred FeatureList to generate Data Pipeline in Json/Yaml File format.
  3. Feature transformation: Convert Data Pipeline to executable operations and transform original features to candidate features with selected engine, currently Pandas and Spark were supported.
  4. Feature Importance Estimator: perform feature importance analysis on candidate features to remove un-important features, generate the transfomred dataset that includes all finalize features that will be used for training.

Getting Started

DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-8-jre graphviz
pip install pyrecdp[autofe] --pre

Only 3 lines of codes to generate new features for your tabular data. Usually 5x new features can be found with up to 1.2x accuracy boost

from pyrecdp.autofe import AutoFE

pipeline = AutoFE(dataset=train_data, label=target_label, time_series = 'Day')
transformed_train_df = pipeline.fit_transform()

Built-In Use Cases

Workflow Name Description
NYC taxi fare Fare prediction based on NYC taxi dataset
Amazon Product Review Product recommandation based on reviews from Amazon
IBM Card Transaction Fraud Detect Recognize fraudulent credit card transactions
Recsys 2023 Real-world task of tweet engagement prediction
Outbrain Click prediction for recommendation system
Covid19 TabUtils integration example with Tabular Utils
PredictiveAssetsMaintenance integration example with predictive assets maintenance use case

High Performance on Terabyte Tabular data processing

Performance