Tutorial 06: Data Sources vs Resources
Query existing infrastructure without managing it - reduce state bloat and dependencies
Prerequisites
Complete these tutorials first: Tutorial 05: Importing Existing Infrastructure
Brutal Truth Up Front
Not everything needs Terraform management. Data sources let you reference existing infrastructure without importing it into state.
The tradeoff: reduced state complexity vs external dependencies. If someone deletes the VPC your data source queries, Terraform fails at plan time. Understanding when to use data sources vs resources prevents unnecessary state bloat and brittle configurations.
Prerequisites
- Completed Tutorials 01-05
- AWS account with existing resources
- Understanding of state file purpose
What You’ll Build
Resources that reference existing infrastructure via data sources instead of importing or hardcoding values. You’ll see the difference in state files and understand dependency implications.
The Exercise
Step 1: The Problem - Hardcoded Values
Create main.tf with hardcoded IDs:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
resource "aws_instance" "app" {
ami = "ami-0c55b159cbfafe1f0" # Hardcoded - will break when AMI is deprecated
instance_type = "t2.micro"
subnet_id = "subnet-abc123" # Hardcoded - breaks in different accounts
tags = {
Name = "app-server"
}
}
Problems:
- AMI IDs differ across regions
- Subnet IDs are account-specific
- Code isn’t portable
- No validation that resources exist
Step 2: Solution - Data Sources
Replace with data sources:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
provider "aws" {
region = "us-east-1"
}
# Query the latest Ubuntu AMI
data "aws_ami" "ubuntu" {
most_recent = true
owners = ["099720109477"] # Canonical
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
filter {
name = "virtualization-type"
values = ["hvm"]
}
}
# Query default VPC
data "aws_vpc" "default" {
default = true
}
# Query subnet in default VPC
data "aws_subnets" "default" {
filter {
name = "vpc-id"
values = [data.aws_vpc.default.id]
}
}
# Use data source outputs in resources
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
subnet_id = tolist(data.aws_subnets.default.ids)[0]
tags = {
Name = "app-server"
}
}
output "ami_id" {
value = data.aws_ami.ubuntu.id
}
output "ami_name" {
value = data.aws_ami.ubuntu.name
}
output "subnet_used" {
value = tolist(data.aws_subnets.default.ids)[0]
}
Apply:
terraform init
terraform apply
Check outputs - you’ll see dynamically discovered values.
Step 3: Inspect State
terraform state list
Notice: Data sources appear in state, but they’re marked as “read-only”. Compare:
terraform state show aws_instance.app
terraform state show data.aws_ami.ubuntu
The instance has full state tracking. The data source only caches query results - Terraform doesn’t manage the AMI.
Step 4: Common Data Source Patterns
Querying Current AWS Account:
data "aws_caller_identity" "current" {}
output "account_id" {
value = data.aws_caller_identity.current.account_id
}
Querying Existing Resources by Tag:
data "aws_vpc" "prod" {
tags = {
Environment = "production"
}
}
data "aws_security_group" "db" {
tags = {
Name = "database-sg"
}
}
Querying AWS-Managed Resources:
data "aws_availability_zones" "available" {
state = "available"
}
resource "aws_subnet" "example" {
count = 3
vpc_id = aws_vpc.main.id
cidr_block = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
availability_zone = data.aws_availability_zones.available.names[count.index]
}
The Break (Intentional Failure Scenarios)
Scenario 1: Data Source Returns No Results
Query a non-existent VPC:
data "aws_vpc" "nonexistent" {
tags = {
Name = "does-not-exist"
}
}
resource "aws_instance" "app" {
ami = data.aws_ami.ubuntu.id
instance_type = "t2.micro"
subnet_id = data.aws_vpc.nonexistent.id # Will fail
}
Run plan:
terraform plan
Error: no matching VPC found. Terraform fails at plan time - it can’t create a plan without knowing the VPC ID.
Scenario 2: Data Source Returns Multiple Results
Query AMI without most_recent:
data "aws_ami" "ubuntu" {
# most_recent = true # Commented out
owners = ["099720109477"]
filter {
name = "name"
values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
}
}
Error: Your query returned more than one result. Please try a more specific search criteria.
Data sources that return single values (like aws_ami, aws_vpc) require unique matches.
Scenario 3: External Dependency Changes
Someone deletes the VPC your data source queries. Next plan:
terraform plan
Fails immediately - can’t find VPC. Your infrastructure is now undeployable until the VPC is restored or configuration is updated.
This is the risk of external dependencies.
The Recovery
When Data Source Fails
If infrastructure referenced by data source is deleted:
Option 1: Create the missing resource
# Replace data source with resource
resource "aws_vpc" "main" {
cidr_block = "10.0.0.0/16"
tags = {
Name = "main"
}
}
# Update references
resource "aws_instance" "app" {
# ...
subnet_id = aws_subnet.example.id
}
Option 2: Update data source query
data "aws_vpc" "main" {
default = true # Fallback to default VPC
}
Converting Data Source to Resource
If you decide to manage previously-queried infrastructure:
- Remove data source from HCL
- Add resource block
- Import existing infrastructure:
terraform import aws_vpc.main vpc-abc123
- Verify plan shows no changes
Exit Criteria
You understand this tutorial if you can:
- Differentiate when to use data sources vs resources
- Query AWS resources using filters and tags
- Handle data source errors gracefully
- Explain state implications of data sources vs resources
- Convert between data sources and resources
Key Lessons
- Data sources query, resources manage - choose based on ownership
- Data sources reduce state complexity - fewer resources to track
- External dependencies are fragile - queried resources can disappear
- Filters must be specific - ambiguous queries fail
- State contains cached queries - data sources refresh on plan
Why This Matters in Production
Use data sources when:
- Querying AWS-managed resources (availability zones, AMIs)
- Referencing infrastructure owned by another team
- Reducing blast radius (not responsible for VPC lifecycle)
- Portability across accounts/regions
Use resources when:
- You own the lifecycle
- Changes require coordination
- Compliance requires audit trail of all managed infrastructure
In FedRAMP High environments:
# Don't manage the entire VPC - networking team owns it
data "aws_vpc" "fedramp" {
tags = {
Compliance = "FedRAMP-High"
Environment = "production"
}
}
# Do manage your application subnet
resource "aws_subnet" "app" {
vpc_id = data.aws_vpc.fedramp.id
cidr_block = "10.0.10.0/24"
tags = {
Component = "Application"
}
}
This separates concerns: networking team manages VPC, your team manages subnets within it.
Real-World Pattern
Shared Services Architecture:
# Shared services account has Transit Gateway
data "aws_ec2_transit_gateway" "shared" {
filter {
name = "tag:Name"
values = ["org-transit-gateway"]
}
}
# Your account attaches to it
resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
transit_gateway_id = data.aws_ec2_transit_gateway.shared.id
vpc_id = aws_vpc.main.id
subnet_ids = aws_subnet.private[*].id
}
You don’t manage the Transit Gateway (owned by networking team), but you do manage your attachment to it.
Data Sources in Modules
When creating reusable modules:
# modules/app-server/main.tf
# Accept VPC ID as input
variable "vpc_id" {
type = string
}
# Query VPC attributes without managing it
data "aws_vpc" "target" {
id = var.vpc_id
}
# Use VPC attributes
resource "aws_security_group" "app" {
vpc_id = data.aws_vpc.target.id
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = [data.aws_vpc.target.cidr_block]
}
}
This allows modules to work with any VPC without requiring full VPC management.
Next Steps
You’ve completed the Foundation track! You now understand:
- State management
- Resource replacement behavior
- Variables and outputs
- Remote state and locking
- Importing existing infrastructure
- Data sources vs resources
Next: Explore the Patterns track for real-world scenarios, or Operations track for production debugging.
Recommended: Tutorial 07: Count vs For_Each - Learn the right way to create multiple similar resources.
Cleanup
terraform destroy
Additional Resources
Keywords
Need Help Implementing This?
I help government contractors and defense organizations modernize their infrastructure using Terraform and AWS GovCloud. With 15+ years managing DoD systems and active Secret clearance, I understand compliance requirements that commercial consultants miss.