DepressionProject

Codebase for Prediction of adolescent depression from prenatal and childhood data from ALSPAC using machine learning

Install python 3.10

Install MSAP

git clone https://github.com/asmyoo/MSAP.git
pip install -e ./MSAP

Install kneed

git clone https://github.com/asmyoo/kneed.git
pip install -e ./kneed

Install kneebow

git clone https://github.com/asmyoo/kneebow.git
pip install -e ./kneebow

Install requirements

pip install -r requirements.txt

for additional info, might need some files in old_files folder
hpc is what was used to run all the results except for ones requiring ipynb
local is what was used to run ipynb results

Get “preprocessed” data - run reformat_ml.py within src/preprocess

python reformat_ml.py

Make 12to18 data if want to change meaning to be 1 anywhere even if missing data to be 1 within src/preprocess

python make_12to18.py

Change model_selecting.py config to use the new dataset preprocessed_data_without_temporal_12to18.csv

Make 12to18 average depression score within src/preprocess

python make_12to18ave.py

Change model_selecting.py config to use the new dataset preprocessed_data_without_temporal_12to18ave.csv

Change configs

Cleaning.py for % missing value imputation and make sure columns_ignored contains child id variable name
Model_selecting.py for age_cutoff and column_dependent

Run get_config_info.py within src/preprocess

python get_config_info.py

Make sure within get_config_info the default preprocessed data filename is correct
Prediction label is 0/1 so does not need to be marked as categorical unless mistake is made
Change preprocessing.py config categorical variables if needed (probably not)
Change cleaning.py with columns_ignored to add mental health variables (don't do for now because our predictions seem to use these variables heavily to predict)

Run depression-predictor run_eda.py and feature_analysis_correlations_iterativeimpute.ipynb

Make sure to create a new conda environment for the requirements for depression-predictor

git clone https://github.com/asmyoo/depression-predictor.git
cd depression-predictor
pip install -r requirements.txt
cd ..
python -u -m depression-predictor.depp.run_eda

Copy the Variables excel file and preprocessed data into the depression-predictor data folder
Check filename for data in depression-predictor utils/dataset.py
Takes approx 1 hr
Copy vars_sorted.csv to DepressionProject/output
Then run python notebook feature_analysis_correlations_iterativeimpute.ipynb

Run run_cleaner.py

python -u -m DepressionProject.run_cleaner

Make sure to not overwrite png's from feature_analysis_correlations_iterativeimpute.ipynb, missing_value png’s, and data_cleaned.csv's

Run run_encode.py

python -m DepressionProject.run_encode DepressionProject/output/data_cleaned.csv DepressionProject/output/data_cleaned_encoded.csv

Move output files into output folder (separated by age, include png's and etc)

Run run_model_selection.py

Use script

Run run_analysis.py

Use script

Run univariate comparison

python -u -m DepressionProject.run_univariate \
    ./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/results.pkl \
    ./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/preprocessed \
    ./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/data_cleaned_encoded.csv \
    ./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/balanced_accuracy \
    y12to18_Dep_YN_216m \
    --use-balanced-accuracy

Run fix_embed_colors for age 12/if colors are switched for depressed/not depressed

python -u -m DepressionProject.fix_embed_colors \
    ./DepressionProject/output/pval_filter_60_MVI/output_12_yesmental/results.pkl \
    ./DepressionProject/output/pval_filter_60_MVI/output_12_yesmental/preprocessed \
    ./DepressionProject/output/pval_filter_60_MVI/output_12_yesmental/data_cleaned_encoded.csv \
    ./DepressionProject/output/pval_filter_60_MVI/output_12_yesmental/ \
    y12CH_Dep_YN_144m

Run make_readable_all_var_sorted.py to change the description column of all vars_sorted_dir_ranked_rounded.csv to be more readable

python -u -m DepressionProject.make_readable_all_var_sorted ./DepressionProject/output/pval_filter_60_MVI

Run make_readable_heatmapcsv.py if have pearson.csv of x and y variables that are highly correlated or anticorrelated after looking at the pearson heatmap

python -u -m DepressionProject.make_readable_heatmapcsv ./DepressionProject/output/rfe_pearson_spearman/output_12_yesmental

Run get_unique_fts for getting list of unique features for each model

python -u -m DepressionProject.get_unique_fts ./DepressionProject/output/pval_filter_60_MVI

Run rank_pearson_rfe for getting table of pearson correlations

python -u -m DepressionProject.rank_pearson_rfe ./DepressionProject/output/pval_filter_60_MVI

Run run_tsne_cluster.py for age 12to18 to understand one cluster

python -u -m DepressionProject.run_tsne_cluster \
./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/results.pkl \
./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/preprocessed \
./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/data_cleaned_encoded.csv \
./DepressionProject/output/pval_filter_60_MVI/output_12to18_yesmental/f1 \
y12to18_Dep_YN_216m

Run plot_rfe_jaccard.py to compare overlap of RFE selected features

python -u -m DepressionProject.plot_rfe_jaccard \
./DepressionProject/output/pval_filter_60_MVI/Supplementary\ Spreadsheet\ 3.xlsx
./DepressionProject/output/pval_filter_60_MVI/rfe_jaccard.svg

Run get_top_10_rfe.py to get top 10 features from RFE for all ages into a csv

python -u -m DepressionProject.get_top_10_rfe \
./DepressionProject/output/pval_filter_60_MVI/Supplementary\ Spreadsheet\ 3.xlsx
./DepressionProject/output/pval_filter_60_MVI/rfe_jaccard.svg

Run print_num_fts_missingvalratio.py to get number of features and missing value ratio before cleaning

python -u -m DepressionProject.print_num_fts_missvalratio

Check duplicate samples that were created on accident prior to analysis

Get “preprocessed” data with more info - run reformat_ml_checkdups.py within src/preprocess

python reformat_ml_checkdups.py

Run clean_dups.py within src/preprocess

python clean_dups.py

Run check_dups.py within src/preprocess to see if the duplicates affect the analysis

python check_dups.py

Check missing value ratio before analysis again

python -u -m DepressionProject.print_num_fts_missvalratio --path_data ./DepressionProject/output/preprocessed_data_without_temporal_checkdup_cleaned_no_info.csv

Run run_tsne.py (don't need)

Run make_readable_pcc_sc_kendall.py and make_readable_list.py after pasting in the best rfe list and lists from run_univariate's output from src/preprocess

Plot tsne using only best results from RFE/Elbow method (don't need)

Make sure to input the hardcoded variables for the rfe results

python -u -m DepressionProject.run_tsne_use_rfe_results_all \
    ./DepressionProject/output/10MVIout/output_12_yesmental \
    ./DepressionProject/output/10MVIout/output_16_yesmental \
    ./DepressionProject/output/10MVIout/output_17_yesmental \
    ./DepressionProject/output/10MVIout/output_18_yesmental \
    y12CH_Dep_YN_144m \
    y16CH_Dep_YN_192m \
    y17CH_Dep_YN_204m \
    y18CH_Dep_YN_216m

Calculate F1 baselines and plot into confusion matrix (don't need)

python -u -m DepressionProject.run_f1_calcs_baseline_all \
    ./DepressionProject/output/10MVIout/output_12_yesmental \
    ./DepressionProject/output/10MVIout/output_16_yesmental \
    ./DepressionProject/output/10MVIout/output_17_yesmental \
    ./DepressionProject/output/10MVIout/output_18_yesmental \
    y12CH_Dep_YN_144m \
    y16CH_Dep_YN_192m \
    y17CH_Dep_YN_204m \
    y18CH_Dep_YN_216m

Plot F1's with their baseline (don't need)

python -u -m DepressionProject.plot_f1_overall
    ./DepressionProject/output/10MVIout/f1s.png

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
configs		configs
data		data
datacheck_code		datacheck_code
output		output
scripts		scripts
src		src
utils		utils
README.md		README.md
feature_analysis_correlations_iterativeimpute.ipynb		feature_analysis_correlations_iterativeimpute.ipynb
fix_embed_colors.py		fix_embed_colors.py
get_top_10_rfe.py		get_top_10_rfe.py
get_unique_fts.py		get_unique_fts.py
make_readable_all_var_sorted.py		make_readable_all_var_sorted.py
make_readable_heatmapcsv.py		make_readable_heatmapcsv.py
plot_f1_overall.py		plot_f1_overall.py
plot_rfe_fang.py		plot_rfe_fang.py
plot_rfe_jaccard.py		plot_rfe_jaccard.py
print_num_fts_missvalratio.py		print_num_fts_missvalratio.py
rank_pearson_rfe.py		rank_pearson_rfe.py
requirements.txt		requirements.txt
run_analysis.py		run_analysis.py
run_cleaner.py		run_cleaner.py
run_encode.py		run_encode.py
run_f1_calcs_baseline_all.py		run_f1_calcs_baseline_all.py
run_model_selection.py		run_model_selection.py
run_tsne.py		run_tsne.py
run_tsne_cluster.py		run_tsne_cluster.py
run_tsne_use_rfe_results_all.py		run_tsne_use_rfe_results_all.py
run_univariate.py		run_univariate.py

IBPA/DepressionProject

Folders and files

Latest commit

History

Repository files navigation

DepressionProject

Install python 3.10

Install MSAP

Install kneed

Install kneebow

Install requirements

Get “preprocessed” data - run reformat_ml.py within src/preprocess

Make 12to18 data if want to change meaning to be 1 anywhere even if missing data to be 1 within src/preprocess

Make 12to18 average depression score within src/preprocess

Change configs

Run get_config_info.py within src/preprocess

Run depression-predictor run_eda.py and feature_analysis_correlations_iterativeimpute.ipynb

Run run_cleaner.py

Run run_encode.py

Run run_model_selection.py

Run run_analysis.py

Run univariate comparison

Run fix_embed_colors for age 12/if colors are switched for depressed/not depressed

Run make_readable_all_var_sorted.py to change the description column of all vars_sorted_dir_ranked_rounded.csv to be more readable

Run make_readable_heatmapcsv.py if have pearson.csv of x and y variables that are highly correlated or anticorrelated after looking at the pearson heatmap

Run get_unique_fts for getting list of unique features for each model

Run rank_pearson_rfe for getting table of pearson correlations

Run run_tsne_cluster.py for age 12to18 to understand one cluster

Run plot_rfe_jaccard.py to compare overlap of RFE selected features

Run get_top_10_rfe.py to get top 10 features from RFE for all ages into a csv

Run print_num_fts_missingvalratio.py to get number of features and missing value ratio before cleaning

Check duplicate samples that were created on accident prior to analysis

Get “preprocessed” data with more info - run reformat_ml_checkdups.py within src/preprocess

Run clean_dups.py within src/preprocess

Run check_dups.py within src/preprocess to see if the duplicates affect the analysis

Check missing value ratio before analysis again

Run run_tsne.py (don't need)

Run make_readable_pcc_sc_kendall.py and make_readable_list.py after pasting in the best rfe list and lists from run_univariate's output from src/preprocess

Plot tsne using only best results from RFE/Elbow method (don't need)

Calculate F1 baselines and plot into confusion matrix (don't need)

Plot F1's with their baseline (don't need)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages