Skip to content

Commit 35aa7e2

Browse files
authored
Habana Gaudi AWS DL1 Base Example (#30)
* Initial commit * Update cloudinit flow * Update cloud-init flow * Updated cloud-init * Update Gaudi example * Updated readme * Remove older files * Delete examples/gen-ai-demo/main.tf * Add Habana links to readme
1 parent b66473c commit 35aa7e2

File tree

7 files changed

+316
-2
lines changed

7 files changed

+316
-2
lines changed

examples/gen-ai-gaudi-base/README.md

Lines changed: 125 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,125 @@
1+
<p align="center">
2+
<img src="https://github.com/intel/terraform-intel-aws-vm/blob/main/images/logo-classicblue-800px.png?raw=true" alt="Intel Logo" width="250"/>
3+
</p>
4+
5+
# Intel® Optimized Cloud Modules for Terraform
6+
7+
© Copyright 2024, Intel Corporation
8+
9+
## AWS DL1 EC2 Instance with Intel Gaudi Accelerators
10+
11+
This demo will showcase Large Language Model(LLM) inference using Intel Gaudi AI Accelerators. This module will install the base software required to run other examples.
12+
13+
## Usage
14+
15+
### variables.tf
16+
17+
Modify the region to target a specific AWS Region
18+
19+
```hcl
20+
variable "region" {
21+
description = "Target AWS region to deploy EC2 in."
22+
type = string
23+
default = "us-east-1"
24+
}
25+
```
26+
27+
### main.tf
28+
29+
Modify settings in this file to choose your AMI as well as other details around the instance that will be created. This demo was tested on Ubuntu 22.04.
30+
31+
```hcl
32+
## Get latest Ubuntu 22.04 AMI in AWS for x86
33+
data "aws_ami" "ubuntu-linux-2204" {
34+
most_recent = true
35+
owners = ["099720109477"] # Canonical
36+
filter {
37+
name = "name"
38+
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
39+
}
40+
filter {
41+
name = "virtualization-type"
42+
values = ["hvm"]
43+
}
44+
}
45+
46+
module "ec2-vm" {
47+
source = "intel/aws-vm/intel"
48+
key_name = aws_key_pair.TF_key.key_name
49+
instance_type = "dl1.24xlarge"
50+
availability_zone = "us-east-1a"
51+
ami = data.aws_ami.ubuntu-linux-2204.id
52+
user_data = data.cloudinit_config.ansible.rendered
53+
54+
root_block_device = [{
55+
volume_size = "100"
56+
}]
57+
58+
tags = {
59+
Name = "my-test-vm-${random_id.rid.dec}"
60+
Owner = "OwnerName-${random_id.rid.dec}",
61+
Duration = "2"
62+
}
63+
}
64+
```
65+
66+
Run the Terraform Commands below to deploy the demos.
67+
68+
```Shell
69+
terraform init
70+
terraform plan
71+
terraform apply
72+
```
73+
74+
## Running the Demo using AWS CloudShell
75+
76+
Open your AWS account and click the Cloudshell prompt
77+
At the command prompt enter in in these command prompts to install Terraform into the AWS Cloudshell
78+
79+
```Shell
80+
git clone https://github.com/tfutils/tfenv.git ~/.tfenv
81+
mkdir ~/bin
82+
ln -s ~/.tfenv/bin/* ~/bin/
83+
tfenv install 1.3.0
84+
tfenv use 1.3.0
85+
```
86+
87+
Download and run the [Gen-AI-Gaudi-Demo](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-gaudi-base) Terraform Module by typing this command
88+
89+
```Shell
90+
git clone https://github.com/intel/terraform-intel-aws-vm.git
91+
```
92+
93+
Change into the `examples/gen-ai-gaudi-base` example folder
94+
95+
```Shell
96+
cd terraform-intel-aws-vm/examples/gen-ai-gaudi-demo
97+
```
98+
99+
Run the Terraform Commands below to deploy the demos.
100+
101+
```Shell
102+
terraform init
103+
terraform plan
104+
terraform apply
105+
```
106+
107+
After the Terraform module successfully creates the EC2 instance, **wait ~15 minutes** for the recipe to download/install the Intel Gaudi driver and software. After the deployment is done, you can launch the Habana Gaudi PyTorch container using the following:
108+
109+
```bash
110+
sudo docker run -it --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none --cap-add=sys_nice --net=host --ipc=host vault.habana.ai/gaudi-docker/1.15.1/ubuntu22.04/habanalabs/pytorch-installer-2.2.0:latest
111+
```
112+
113+
## Deleting the Demo
114+
115+
To delete the demo, run `terraform destroy` to delete all resources created.
116+
117+
## Considerations
118+
119+
- The AWS region where this example is run should have a default VPC
120+
121+
## Links
122+
123+
[Intel® Gaudi® AI Accelerator](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html)
124+
125+
[Intel® Gaudi® AI Accelerator - Developer Website](https://developer.habana.ai/)
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#cloud-config
2+
package_update: true
3+
package_upgrade: true
4+
5+
runcmd:
6+
- apt install git ansible docker.io -y
7+
- git clone https://github.com/intel/optimized-cloud-recipes.git /opt/optimized-cloud-recipes
8+
- echo "@reboot ansible-playbook /opt/optimized-cloud-recipes/recipes/ai-gaudi-ubuntu/recipe.yml" | crontab -
9+
- reboot

examples/gen-ai-demo/main.tf renamed to examples/gen-ai-gaudi-base/main.tf

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Provision EC2 Instance on Icelake on Amazon Linux OS in default vpc. It is configured to create the EC2 in
1+
# Provision EC2 DL1 Instance on Ubuntu Linux OS in default vpc. It is configured to create the EC2 in
22
# US-East-1 region. The region is provided in variables.tf in this example folder.
33

44
# This example also create an EC2 key pair. Associate the public key with the EC2 instance. Create the private key
@@ -80,7 +80,7 @@ module "ec2-vm" {
8080
count = var.vm_count
8181
source = "intel/aws-vm/intel"
8282
key_name = aws_key_pair.TF_key.key_name
83-
instance_type = "m7i.4xlarge"
83+
instance_type = "dl1.24xlarge"
8484
availability_zone = "us-east-1c"
8585
ami = data.aws_ami.ubuntu-linux-2204.id
8686
user_data = data.cloudinit_config.ansible.rendered

examples/gen-ai-gaudi-base/outputs.tf

Lines changed: 113 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,113 @@
1+
output "id" {
2+
description = "The ID of the instance"
3+
value = try(module.ec2-vm.*.id, module.ec2-vm.*.id, "")
4+
}
5+
6+
output "arn" {
7+
description = "The ARN of the instance"
8+
value = try(module.ec2-vm.*.arn, "")
9+
}
10+
11+
output "capacity_reservation_specification" {
12+
description = "Capacity reservation specification of the instance"
13+
value = try(module.ec2-vm.*.capacity_reservation_specification, "")
14+
}
15+
16+
output "instance_state" {
17+
description = "The state of the instance. One of: `pending`, `running`, `shutting-down`, `terminated`, `stopping`, `stopped`"
18+
value = try(module.ec2-vm.*.instance_state, "")
19+
}
20+
21+
output "outpost_arn" {
22+
description = "The ARN of the Outpost the instance is assigned to"
23+
value = try(module.ec2-vm.*.outpost_arn, "")
24+
}
25+
26+
output "password_data" {
27+
description = "Base-64 encoded encrypted password data for the instance. Useful for getting the administrator password for instances running Microsoft Windows. This attribute is only exported if `get_password_data` is true"
28+
value = try(module.ec2-vm.*.password_data, "")
29+
}
30+
31+
output "primary_network_interface_id" {
32+
description = "The ID of the instance's primary network interface"
33+
value = try(module.ec2-vm.*.primary_network_interface_id, "")
34+
}
35+
36+
output "private_dns" {
37+
description = "The private DNS name assigned to the instance. Can only be used inside the Amazon EC2, and only available if you've enabled DNS hostnames for your VPC"
38+
value = try(module.ec2-vm.*.private_dns, "")
39+
}
40+
41+
output "public_dns" {
42+
description = "The public DNS name assigned to the instance. For EC2-VPC, this is only available if you've enabled DNS hostnames for your VPC"
43+
value = try(module.ec2-vm.*.public_dns, "")
44+
}
45+
46+
output "public_ip" {
47+
description = "The public IP address assigned to the instance, if applicable. NOTE: If you are using an aws_eip with your instance, you should refer to the EIP's address directly and not use `public_ip` as this field will change after the EIP is attached"
48+
value = try(module.ec2-vm.*.public_ip, "")
49+
}
50+
51+
output "private_ip" {
52+
description = "The private IP address assigned to the instance."
53+
value = try(module.ec2-vm.*.private_ip, "")
54+
}
55+
56+
output "ipv6_addresses" {
57+
description = "The IPv6 address assigned to the instance, if applicable."
58+
value = try(module.ec2-vm.*.ipv6_addresses, [])
59+
}
60+
61+
output "tags_all" {
62+
description = "A map of tags assigned to the resource, including those inherited from the provider default_tags configuration block"
63+
value = try(module.ec2-vm.*.tags_all, {})
64+
}
65+
66+
output "spot_bid_status" {
67+
description = "The current bid status of the Spot Instance Request"
68+
value = try(module.ec2-vm.*.spot_bid_status, "")
69+
}
70+
71+
output "spot_request_state" {
72+
description = "The current request state of the Spot Instance Request"
73+
value = try(module.ec2-vm.*.spot_request_state, "")
74+
}
75+
76+
output "spot_instance_id" {
77+
description = "The Instance ID (if any) that is currently fulfilling the Spot Instance request"
78+
value = try(module.ec2-vm.*.spot_instance_id, "")
79+
}
80+
81+
################################################################################
82+
# IAM Role / Instance Profile
83+
################################################################################
84+
85+
output "iam_role_name" {
86+
description = "The name of the IAM role"
87+
value = try(module.ec2-vm.*.aws_iam_role.name, null)
88+
}
89+
90+
output "iam_role_arn" {
91+
description = "The Amazon Resource Name (ARN) specifying the IAM role"
92+
value = try(module.ec2-vm.*.aws_iam_role.arn, null)
93+
}
94+
95+
output "iam_role_unique_id" {
96+
description = "Stable and unique string identifying the IAM role"
97+
value = try(module.ec2-vm.*.aws_iam_role.unique_id, null)
98+
}
99+
100+
output "iam_instance_profile_arn" {
101+
description = "ARN assigned by AWS to the instance profile"
102+
value = try(module.ec2-vm.*.aws_iam_instance_profile.arn, null)
103+
}
104+
105+
output "iam_instance_profile_id" {
106+
description = "Instance profile's ID"
107+
value = try(module.ec2-vm.*.aws_iam_instance_profile.id, null)
108+
}
109+
110+
output "iam_instance_profile_unique" {
111+
description = "Stable and unique string identifying the IAM instance profile"
112+
value = try(module.ec2-vm.*.aws_iam_instance_profile.unique_id, null)
113+
}
Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
provider "aws" {
2+
# Environment Variables used for Authentication
3+
region = var.region
4+
}
Lines changed: 50 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
variable "region" {
2+
description = "Target AWS region to deploy EC2 in."
3+
type = string
4+
default = "us-east-1"
5+
}
6+
7+
# Variable to add ingress rules to the security group. Replace the default values with the required ports and CIDR ranges.
8+
variable "ingress_rules" {
9+
type = list(object({
10+
from_port = number
11+
to_port = number
12+
protocol = string
13+
cidr_blocks = string
14+
}))
15+
default = [
16+
{
17+
from_port = 22
18+
to_port = 22
19+
protocol = "tcp"
20+
cidr_blocks = "0.0.0.0/0"
21+
22+
},
23+
{
24+
from_port = 7860
25+
to_port = 7860
26+
protocol = "tcp"
27+
cidr_blocks = "0.0.0.0/0"
28+
29+
},
30+
{
31+
from_port = 5000
32+
to_port = 5000
33+
protocol = "tcp"
34+
cidr_blocks = "0.0.0.0/0"
35+
},
36+
{
37+
from_port = 5001
38+
to_port = 5001
39+
protocol = "tcp"
40+
cidr_blocks = "0.0.0.0/0"
41+
}
42+
]
43+
}
44+
45+
# Variable for how many VMs to build
46+
variable "vm_count" {
47+
description = "Number of VMs to build."
48+
type = number
49+
default = 1
50+
}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
terraform {
2+
required_version = ">=1.3.0"
3+
required_providers {
4+
aws = {
5+
source = "hashicorp/aws"
6+
version = "~> 5.31"
7+
}
8+
cloudinit = {
9+
source = "hashicorp/cloudinit"
10+
version = ">=2.2.0"
11+
}
12+
}
13+
}

0 commit comments

Comments
 (0)