-
Perception What it does: Extract features from images/videos. Industries: Healthcare (medical imaging), Automotive (autonomous driving), Security (surveillance). Tech: CNNs, Vision Transformers (ViT), Image Segmentation (U-Net); Deep Learning (Computer Vision).
-
Speech Recognition What it does: Convert spoken audio to text. Industries: Customer Support, Consumer Electronics (voice assistants), Automotive (voice commands). Tech: RNNs, LSTMs, Transformers, CTC loss; Deep Learning (Speech Processing/NLP).
-
Text Understanding What it does: Comprehend text intent, entities, sentiment. Industries: Finance (document analysis), Legal (contract review), Customer Service. Tech: Transformers (BERT, RoBERTa), Named Entity Recognition (NER); Deep Learning (NLP).
-
Text Generation What it does: Produce coherent language output. Industries: Marketing (content creation), Media (summarization), Education (tutoring). Tech: Autoregressive Transformers (GPT family), Seq2Seq models; Deep Learning (NLP).
-
Knowledge Retrieval What it does: Retrieve relevant external info for tasks. Industries: Tech Support, Research, Healthcare. Tech: Dense vector retrieval with k-NN, embedding models (BERT embeddings), combined with LLMs (RAG); ML + DL (Information Retrieval + NLP).
-
Multimodal Fusion What it does: Align and integrate multiple data types. Industries: Retail (visual search), Entertainment (video captioning), Autonomous Systems. Tech: Multimodal Transformers, Cross-Attention; Deep Learning (Multimodal AI).
-
Prediction What it does: Forecast or detect anomalies from data. Industries: Finance (fraud detection), Manufacturing (predictive maintenance), Energy (demand forecasting). Tech: Regression, Random Forest, Gradient Boosting, LSTM; Machine Learning + Deep Learning (Time-Series Analysis).
-
Decision Making What it does: Optimize actions/plans based on goals. Industries: Logistics (route planning), Robotics, Gaming. Tech: Reinforcement Learning (Q-learning, Policy Gradients), Heuristic Search; Machine Learning (Reinforcement Learning).
-
Generative Content Creation What it does: Create new images, audio, code, etc. Industries: Advertising, Software Dev, Arts & Music. Tech: GANs, Diffusion Models, Autoregressive models (Codex); Deep Learning (Generative Models).
-
Autonomous Agents What it does: Autonomous perception, reasoning, and action. Industries: Autonomous Vehicles, Virtual Assistants, Industrial Automation. Tech: Integration of CNNs, Transformers, RL, Planning Algorithms; AI Systems + ML + DL (Agent-based AI).
AI is the broad field of creating intelligent systems that can mimic human behavior.
- Rule-Based Systems
- Knowledge Graphs
- Expert Systems
ML is a subset of AI focused on systems that learn from data.
-
Supervised Learning
- Tasks: Regression, Classification
- Algorithms:
- Linear Regression
- Logistic Regression
- Decision Trees
- Random Forest
- Support Vector Machine (SVM)
-
Unsupervised Learning
- Tasks: Clustering, Dimensionality Reduction
- Algorithms:
- K-Means
- DBSCAN
- PCA
- t-SNE
-
Semi-Supervised Learning
-
Self-Supervised Learning
-
Reinforcement Learning (RL)
- Algorithms:
- Q-Learning
- SARSA
- Deep Q-Network (DQN)
- Proximal Policy Optimization (PPO)
- A3C, DDPG, etc.
- Algorithms:
- Classical ML – Uses above algorithms
- Deep Learning (DL) – Uses neural networks:
- Feedforward Neural Networks (FNN / DNN)
- Convolutional Neural Networks (CNN)
- Recurrent Neural Networks (RNN)
- LSTM, GRU
- Autoencoders (AE, VAE)
- Generative Adversarial Networks (GANs)
- Transformers
- Used in NLP, LLMs, Vision, Speech
Processes and understands human language.
-
Tasks:
- Text Classification
- Named Entity Recognition (NER)
- Machine Translation
- Summarization, QA
-
Models:
- RNN, LSTM
- BERT, RoBERTa
- GPT Series
- T5, XLNet
-
Includes: Large Language Models (LLMs)
- GPT-3, GPT-4, GPT-4o
- LLaMA, Claude, PaLM
- Used in chatbots, agents, RAG systems
Processes and understands visual data (images, videos).
-
Tasks:
- Image Classification
- Object Detection
- Image Segmentation
- Image Generation
-
Models:
- CNNs: VGG, ResNet, EfficientNet
- Vision Transformers (ViT, DINO)
- GANs: StyleGAN, CycleGAN
-
Applications:
- Facial Recognition, OCR, Medical Imaging
Processes audio and speech.
-
Tasks:
- ASR (Automatic Speech Recognition)
- TTS (Text-to-Speech)
- Speaker Identification
-
Models:
- RNNs, CNNs
- Transformers (e.g., Whisper)
- WaveNet, Tacotron
Combines multiple input types: text + image + audio + video.
- Examples:
- CLIP (text + image)
- Whisper (speech + text)
- Flamingo, GPT-4o, Gemini, Sora
Combines LLMs with external data sources.
-
Components:
- Embedding Models
- Vector Databases (e.g., FAISS, Pinecone)
- LLMs for answer generation
-
Use Cases:
- Chat over documents
- Internal knowledge bots
- QA over web, PDFs, databases
- NLP is the science of understanding and working with language.
- LLMs are advanced tools (like ChatGPT) used within NLP to understand and generate text.
- Gen AI is the bigger umbrella that includes LLMs and also tools that make: