This is a project to demostrate how we can use Natural Language Processing and Machine Learning techniques to classify articles or paragraphs. This project is built with Python and my code can be found in folder "ipynb"
Script to download about 5,000 articles from 8 categories using Wikipedia API to a Mongo Database
Clean the article text using RegEx
- Perform feature engineering using Natural Language Processing techniques - Latent Semantic Analysis (Tfidf and SVD) stored models in Redis database from later article pridiction steps
- Plot 2D LSA components to examine how artical vectors cluster in the space using LSA method
Examine accuracy score on different supervised machine learning models GridSearch Decision Tree and KNeighbors models to get the best accuracy possible for a model that predicts the category of a new article
Use Cosine Similarity to determine article category
Create a semantic search engine, where you can input a search term and get a set of articles that are closest to that term, based on NearestNeighbors algorithm