This repository contains various datasets for data analysis, machine learning, and educational purposes. Below is a brief description of each dataset available in this repository.
- Contains Body Mass Index (BMI) data.
- Useful for health and fitness analysis.
- Contains department-related information.
- Useful for organizational data processing.
- Contains employee details.
- Can be used for HR analytics and workforce management.
- Classic Iris dataset for machine learning.
- Contains different species of iris flowers with their measurements.
- Contains item similarity data.
- Useful for recommendation system development.
- Dataset containing information about movies.
- Useful for movie recommendation models.
- Contains music genre classification data.
- Can be used for genre prediction models.
- Not a database it's for AVR custom Marker
- Sample dataset for practicing pandas library operations.
- Useful for learning data manipulation.
- Another dataset for pandas tutorials.
- Contains structured data for training purposes.
- Contains user ratings for various items.
- Useful for collaborative filtering and recommendation systems.
- A sample dataset.
- Can be used for testing and learning purposes.
- A test dataset.
- Used for validation and experimentation.
These datasets can be used for:
- Machine learning projects
- Data analysis and visualization
- Educational and tutorial purposes
If you have additional datasets to contribute, feel free to upload them and update this README with the necessary descriptions.
These datasets are provided for educational and research purposes. Please check individual datasets for any specific license information.
For any questions or suggestions, feel free to raise an issue or contact Lovnish Verma.
A list of public datasets for machine learning, AI, data science, and analytics projects.
- UCI Machine Learning Repository β Classic datasets used in academic ML research.
- Kaggle Datasets β User-contributed datasets with competitions and notebooks.
- Google Dataset Search β Dataset-specific search engine.
- AWS Open Data Registry β Public datasets hosted on AWS.
- Microsoft Azure Open Datasets β Curated datasets for training on Azure.
- OpenML β Collaborative platform for sharing datasets and experiments.
- Papers with Code β Datasets β ML benchmarks tied to research papers.
- Hugging Face Datasets β NLP, vision, and multimodal datasets.
- Zenodo β Scientific datasets with citation support.
- Figshare β Open-access research datasets.
- Data World β Community platform for data sharing.
- Awesome Public Datasets (GitHub) β Curated list across domains.
- FiveThirtyEight Data β Datasets used in data journalism.
- Quandl β Financial and economic data.
- India AI β Dataset Repository β Indian AI project datasets.
- Data.gov.in β Indian government open data.
- Data.gov (USA) β US federal open datasets.
- EU Open Data Portal β Data from European institutions.
- UK Data Service β Economic and social research datasets (UK).
- Canada Open Government β Datasets from Canada.
- Australia Data Portal β Australian government datasets.
- ImageNet β Large-scale image classification dataset.
- COCO Dataset β Object detection, segmentation, and captioning.
- Open Images Dataset β Annotated image data.
- Stanford Dogs Dataset β Fine-grained image classification.
- Common Crawl β Large-scale web crawl data.
- Wikipedia Dumps β Raw Wikipedia text.
- Project Gutenberg β Public domain books for NLP.
- TREC Question Classification β NLP benchmark dataset.
- PhysioNet β Physiological and clinical data.
- MIMIC-III β ICU medical data (de-identified).
- NIH Biomedical Data β NIH open data portal.
- Cancer Imaging Archive β Medical imaging data for cancer research.
- OpenSLR β Speech recognition datasets.
- LibriSpeech ASR β Audiobook dataset for speech recognition.
- OpenStreetMap (Geofabrik) β Extracts of OSM data.
- Google Open Buildings β Global building footprints.
Name | Domain | Link |
---|---|---|
UCI ML Repo | General | Link |
Kaggle | General | Link |
IndiaAI | Govt (India) | Link |
Data.gov.in | Govt (India) | Link |
Data.gov | Govt (USA) | Link |
Data World | General | Link |
Hugging Face | NLP/ML | Link |
Papers with Code | Benchmarks | Link |
Zenodo | Research | Link |
For code integration and automatic downloads, you can often use Python libraries such as:
from datasets import load_dataset
dataset = load_dataset("imdb") # Hugging Face example
You can also automate downloads from Kaggle via API:
kaggle datasets download -d username/dataset-name
Feel free to contribute more sources via pull request!