This CLI tool provides a streamlined way to preprocess structured data files (CSV only) by offering various data cleaning and transformation functionalities. Users can execute individual preprocessing steps or chain multiple steps in a single command.
-
Load Data Load a dataset from a specified CSV file.
-
Handle Missing Values (
mv
)- Remove Missing Values (
rm
)
Removes rows containing any missing values. - Fill with Default (
fl_<value>
)
Fills missing values with a specified default value (e.g.,fl_0
fills missing values with 0).
- Remove Missing Values (
-
Remove Duplicates (
dp
)
Removes duplicate rows from the dataset. -
Normalization & Standardization (
fs
)- Normalize (
nm
) - Standardize (
sd
)
- Normalize (
-
Export Processed File Saves the processed dataset to a specified CSV file.
-
CLI Supports Chaining
Multiple processing steps can be applied in a single command. -
Handle Outliers by Z-score (
ol
)- Remove outliers (
rm
) - Replace outliers (
rp
) You could choose feature to apply check outlier. (e.g., ol,rm_Age_Glucose) If you not give the feature to apply, then the tool will check outlier for all feature in csv.
- Remove outliers (
-
Encode Categorical Data (
ec
)- One-Hot Encoding (
oh
) - Ordinal Encoding (
od
) Please noted that you need to provide feature name as parameter to start encoding process or the tools'll raise a issue. Example:
- One-Hot Encoding (
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-ec,oh" ../../input.csv ../../output_directory
Will throw a request to add a feature name as params for oh
.
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-ec,oh_Age_Glucose" ../../input.csv ../../output_directory
This will work.
None, all plan features finished. (Will update more if have a request)
/usr/local/bin/python3 data_tools.py --pipe="<steps>" <inputFilePath> <outputPath>
,
separates main and sub-services.-
separates different main service lines (e.g., handling missing values, feature scaling)._
separates service and parameter (e.g.,fl_0
means fill missing values with 0).
Load -> missing data -> feature scaling all feature with normalization -> handle suplication -> Output to path:
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_0-fs,nm-dp" ../../input.csv ../../output_directory
/usr/local/bin/python3 data_tools.py --pipe="mv,fl_100-dp" ../../input.csv ../../output_directory
/usr/local/bin/python3 data_tools.py --pipe="fs,sd_Age_Glucose" ../../input.csv ../../output_directory
Only need to ensure Python is installed. Not have any other dependencies.
Contributions are welcome! Feel free to submit issues or pull requests.
MIT License