GitHub - joyce-lin/Project_Wikipedia_NLP_ArticleClassifier: Predicting Wikipedia Article Categories using NLP and Machine Learning

Predicting Wikipedia Article Categories

This is a project to demostrate how we can use Natural Language Processing and Machine Learning techniques to classify articles or paragraphs. This project is built with Python and my code can be found in folder "ipynb"

Wiki_insert_to_MongoDB.py:

Script to download about 5,000 articles from 8 categories using Wikipedia API to a Mongo Database

Part2_1_PreProcessing_DataClearning.ipynb:

Clean the article text using RegEx

Part2_2_FeatureEngineering_LSA.ipynb:

Perform feature engineering using Natural Language Processing techniques - Latent Semantic Analysis (Tfidf and SVD) stored models in Redis database from later article pridiction steps
Plot 2D LSA components to examine how artical vectors cluster in the space using LSA method

Part3_Predictor_Supervised.ipynb:

Examine accuracy score on different supervised machine learning models GridSearch Decision Tree and KNeighbors models to get the best accuracy possible for a model that predicts the category of a new article

Part3_Predictor_Unsupervised_CosineSimilarity.ipynb:

Use Cosine Similarity to determine article category

Part3_Predictor_UnSupervised_NearestNeighbor.ipynb:

Create a semantic search engine, where you can input a search term and get a set of articles that are closest to that term, based on NearestNeighbors algorithm

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
doc		doc
ipynb		ipynb
README.md		README.md
Wiki_insert_to_MongoDB.py		Wiki_insert_to_MongoDB.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Predicting Wikipedia Article Categories

Wiki_insert_to_MongoDB.py:

Part2_1_PreProcessing_DataClearning.ipynb:

Part2_2_FeatureEngineering_LSA.ipynb:

Part3_Predictor_Supervised.ipynb:

Part3_Predictor_Unsupervised_CosineSimilarity.ipynb:

Part3_Predictor_UnSupervised_NearestNeighbor.ipynb:

More Details can be found in /doc/Wiki_NLP_ArticleClassifier.pdf

About

Uh oh!

Releases

Packages

Languages

joyce-lin/Project_Wikipedia_NLP_ArticleClassifier

Folders and files

Latest commit

History

Repository files navigation

Predicting Wikipedia Article Categories

Wiki_insert_to_MongoDB.py:

Part2_1_PreProcessing_DataClearning.ipynb:

Part2_2_FeatureEngineering_LSA.ipynb:

Part3_Predictor_Supervised.ipynb:

Part3_Predictor_Unsupervised_CosineSimilarity.ipynb:

Part3_Predictor_UnSupervised_NearestNeighbor.ipynb:

More Details can be found in /doc/Wiki_NLP_ArticleClassifier.pdf

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages