This project explores the use of Deep Reinforcement Learning (DRL) for intelligent packet scheduling in a simulated network router environment. The goal is to build agents that can make real-time queue-serving decisions while balancing latency, QoS (Quality of Service), and fairness across multiple traffic types.
- Prioritize delay-sensitive traffic (Video, Voice) over BestEffort
- Enforce queue-specific QoS constraints:
- Video: Delay β€ 6 ms
- Voice: Delay β€ 4 ms
- Prevent starvation of BestEffort
- Minimize packet drop rate and mean delay
- Penalize excessive queue switching
This project investigates the application of Deep Reinforcement Learning (DRL) to the challenge of real-time packet scheduling in a multi-queue router system. The objective is to dynamically prioritize network trafficβVideo, Voice, and BestEffortβby:
- π Minimizing average delay
- π― Enforcing queue-specific Quality of Service (QoS) constraints
- βοΈ Ensuring fairness across competing traffic classes
Two key RL architectures are explored:
- DQN (Deep Q-Network): Operates on a discretized action/state space
- PPO (Proximal Policy Optimization): Trained in a continuous state space using policy gradients
- π§ͺ
RouterEnv
: A fine-grained continuous-state environment optimized for PPO - ποΈ
TabularStyleRouterEnv
: A discretized router model tailored for DQN-based learning
Agents are trained and tested across multiple scenarios that vary:
- Traffic arrival rates
- Queue switching penalties
- Overall network dynamics
This allows for evaluation under both normal and stress-tested network conditions.
Each episode tracks key performance metrics:
- π Total reward
- β±οΈ Mean delay
- π― QoS success rate
- π Queue switching behavior
Results are exported as:
- π CSV logs for reproducibility
- π Visual plots showing model performance trends over time
This project highlights how deep reinforcement learning can be adapted for QoS-constrained network scheduling, showcasing critical trade-offs between:
- π Policy performance
- βοΈ Fairness across queues
- π Generalization across dynamic traffic scenarios
Module | Description |
---|---|
router_scheduler_env.py |
Primary Gym-style environment with continuous (13D) state |
tabular_style_router_env.py |
Lightweight environment with discretized (6D) tabular-style state |
dqn_agent.py |
DQN agent with replay buffer, target network, and Q-value logging |
ppo_agent.py |
PPO agent using Stable-Baselines3, MlpPolicy |
main_train.py |
Trains and evaluates DQN on Scenario 1 & 2 |
main_train_ppo.py |
Trains and evaluates PPO independently |
plotManager.py |
Generates all evaluation plots, CSV exports, and summaries |
- 13D continuous observation space
- Tracks packet urgency, backlog, delays, and QoS violations
- Includes switch penalty and backlog difference shaping
- 6D MultiDiscrete state space: [length_bin, deadline_flag, delay_bin] Γ 2
- Suitable for tabular approximations but powered by a neural DQN agent
- Emulates Q-table-like behavior with DQN generalization
- 3-layer MLP
- Uses SmoothL1Loss, gradient clipping, and epsilon-greedy with decay
- Experience replay with target network sync every 10 steps
- Custom logging for Q-values every 500 steps
- Policy-gradient method using SB3's
MlpPolicy
- Trained with
DummyVecEnv
andTransformObservation
for tabular compatibility - CPU-optimized (no CNN policy used)
- Video, Voice, BestEffort: 0.3, 0.25, 0.4 arrival rates
- No switch penalty
- Baseline scenario for fair prioritization and reward calibration
- Increased arrival pressure and delayed switches
- Tests agent stability under stress
- Highlights how well the model generalizes or collapses under congestion
- Not used for DQN v3.1.4 intentionally
- Wanted to evaluate each scenario independently
- Ensures Scenario 2 learning is not biased by Scenario 1 policies
- Future models may include transfer + fine-tuning comparisons
Metric | Description |
---|---|
QoS_Success |
Packets delivered within delay constraint |
Avg_Delay |
Per-queue average delay per episode |
Switch_Count |
Queue switches (action != 0) |
Total_Reward |
Smoothed reward = time-based + state-based |
Dropped |
Packets lost due to queue overflow |
QoS_Rate |
Success % out of served packets |
Scenario | Video QoS | Voice QoS | BE Drop | Max Reward | Notes |
---|---|---|---|---|---|
Scenario 1 | β ~99.9% | β ~97.1% | Low | πΌ ~1723 | Stable, fair |
Scenario 2 | β ~98.6% | Medium | π» ~585 | Voice underperforming |
- Saved models in
results/models/
- CSV logs in
results/plots/dqn/csv/
- Evaluation plots in:
results/plots/dqn/
results/plots/ppo/
- β
Reward tuning for Voice queue violations (
v3.1.5+
) - π§ Hybrid agent using shared DQN+PPO architecture
- βοΈ Fairness metric integration and age-based prioritization
- π Comparative baselines (FIFO, EDF, Strict Priority)
- π Curriculum learning for sequential scenario training
- Stable Baselines3 Documentation
- Reinforcement Learning Course β David Silver
- OpenAI Gym
- Gymnasium Project
python main_train.py