GitHub - joyce-lin/Project_MachineLearning_MadelonDataset: Feature Selection and Classification on Madelon Dataset

Feature Selection and Classification on Madelon Dataset

Domain and Data

A synthetic dataset that was created based on UCI's Madelon dataset (https://archive.ics.uci.edu/ml/datasets/Madelon) This dataset contains 5000 features and 220,000 rows (1.1 billion data points) whereas only 50 features contain useful information

Problem Statement

This porject is to apply machine learning techniques to identify important features and then use these techniques to do prediction on the entire set.

Data Manipulation

Due to the size of the dataset, I had to built model on randomly selected subsets of the data.

Subset 1: 0.1% of the data
Subset 2: 0.5% of the data
Subset 3: 1% of the data
Subset 4: 5% of the data

Tasks

Featuer Engineering

Use ANOVA for feature selection to select top 1000 features

Benchmark Modeling

Build naive K-Neighbors and Logistic Regression benchmark models
Use all 4 different size of data subsets

Feature Importance:

Create feature selection (Select K Best/ PCA) pipelines
Use feature selection / dimension reduction to reduce the dataset to a manageable size then use conventional methods

Implement Final Model:

Present best model and apply to larger (or all) data subset

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data		data
doc		doc
ipynb		ipynb
lib		lib
README.md		README.md
drop_cols		drop_cols

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Feature Selection and Classification on Madelon Dataset

Domain and Data

Problem Statement

Data Manipulation

Tasks

Featuer Engineering

Benchmark Modeling

Feature Importance:

Implement Final Model:

More Details can be found in /doc/MachineLearning_MadelonDataset.pdf

About

Uh oh!

Releases

Packages

Languages

joyce-lin/Project_MachineLearning_MadelonDataset

Folders and files

Latest commit

History

Repository files navigation

Feature Selection and Classification on Madelon Dataset

Domain and Data

Problem Statement

Data Manipulation

Tasks

Featuer Engineering

Benchmark Modeling

Feature Importance:

Implement Final Model:

More Details can be found in /doc/MachineLearning_MadelonDataset.pdf

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages