Tutorial 06: Data Sources vs Resources

Brutal Truth Up Front

Not everything needs Terraform management. Data sources let you reference existing infrastructure without importing it into state.

The tradeoff: reduced state complexity vs external dependencies. If someone deletes the VPC your data source queries, Terraform fails at plan time. Understanding when to use data sources vs resources prevents unnecessary state bloat and brittle configurations.

Prerequisites

Completed Tutorials 01-05
AWS account with existing resources
Understanding of state file purpose

What You’ll Build

Resources that reference existing infrastructure via data sources instead of importing or hardcoding values. You’ll see the difference in state files and understand dependency implications.

The Exercise

Step 1: The Problem - Hardcoded Values

Create main.tf with hardcoded IDs:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"  # Hardcoded - will break when AMI is deprecated
  instance_type = "t2.micro"
  subnet_id     = "subnet-abc123"           # Hardcoded - breaks in different accounts
  
  tags = {
    Name = "app-server"
  }
}

Problems:

AMI IDs differ across regions
Subnet IDs are account-specific
Code isn’t portable
No validation that resources exist

Step 2: Solution - Data Sources

Replace with data sources:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# Query the latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
  
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# Query default VPC
data "aws_vpc" "default" {
  default = true
}

# Query subnet in default VPC
data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

# Use data source outputs in resources
resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = tolist(data.aws_subnets.default.ids)[0]
  
  tags = {
    Name = "app-server"
  }
}

output "ami_id" {
  value = data.aws_ami.ubuntu.id
}

output "ami_name" {
  value = data.aws_ami.ubuntu.name
}

output "subnet_used" {
  value = tolist(data.aws_subnets.default.ids)[0]
}

Apply:

terraform init
terraform apply

Check outputs - you’ll see dynamically discovered values.

Step 3: Inspect State

terraform state list

Notice: Data sources appear in state, but they’re marked as “read-only”. Compare:

terraform state show aws_instance.app
terraform state show data.aws_ami.ubuntu

The instance has full state tracking. The data source only caches query results - Terraform doesn’t manage the AMI.

Step 4: Common Data Source Patterns

Querying Current AWS Account:

data "aws_caller_identity" "current" {}

output "account_id" {
  value = data.aws_caller_identity.current.account_id
}

Querying Existing Resources by Tag:

data "aws_vpc" "prod" {
  tags = {
    Environment = "production"
  }
}

data "aws_security_group" "db" {
  tags = {
    Name = "database-sg"
  }
}

Querying AWS-Managed Resources:

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "example" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

The Break (Intentional Failure Scenarios)

Scenario 1: Data Source Returns No Results

Query a non-existent VPC:

data "aws_vpc" "nonexistent" {
  tags = {
    Name = "does-not-exist"
  }
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = data.aws_vpc.nonexistent.id  # Will fail
}

Run plan:

terraform plan

Error: no matching VPC found. Terraform fails at plan time - it can’t create a plan without knowing the VPC ID.

Scenario 2: Data Source Returns Multiple Results

Query AMI without most_recent:

data "aws_ami" "ubuntu" {
  # most_recent = true  # Commented out
  owners = ["099720109477"]
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

Error: Your query returned more than one result. Please try a more specific search criteria.

Data sources that return single values (like aws_ami, aws_vpc) require unique matches.

Scenario 3: External Dependency Changes

Someone deletes the VPC your data source queries. Next plan:

terraform plan

Fails immediately - can’t find VPC. Your infrastructure is now undeployable until the VPC is restored or configuration is updated.

This is the risk of external dependencies.

The Recovery

When Data Source Fails

If infrastructure referenced by data source is deleted:

Option 1: Create the missing resource

# Replace data source with resource
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "main"
  }
}

# Update references
resource "aws_instance" "app" {
  # ...
  subnet_id = aws_subnet.example.id
}

Option 2: Update data source query

data "aws_vpc" "main" {
  default = true  # Fallback to default VPC
}

Converting Data Source to Resource

If you decide to manage previously-queried infrastructure:

Remove data source from HCL
Add resource block
Import existing infrastructure:

terraform import aws_vpc.main vpc-abc123

Verify plan shows no changes

Exit Criteria

You understand this tutorial if you can:

Differentiate when to use data sources vs resources
Query AWS resources using filters and tags
Handle data source errors gracefully
Explain state implications of data sources vs resources
Convert between data sources and resources

Key Lessons

Data sources query, resources manage - choose based on ownership
Data sources reduce state complexity - fewer resources to track
External dependencies are fragile - queried resources can disappear
Filters must be specific - ambiguous queries fail
State contains cached queries - data sources refresh on plan

Why This Matters in Production

Use data sources when:

Querying AWS-managed resources (availability zones, AMIs)
Referencing infrastructure owned by another team
Reducing blast radius (not responsible for VPC lifecycle)
Portability across accounts/regions

Use resources when:

You own the lifecycle
Changes require coordination
Compliance requires audit trail of all managed infrastructure

In FedRAMP High environments:

# Don't manage the entire VPC - networking team owns it
data "aws_vpc" "fedramp" {
  tags = {
    Compliance = "FedRAMP-High"
    Environment = "production"
  }
}

# Do manage your application subnet
resource "aws_subnet" "app" {
  vpc_id     = data.aws_vpc.fedramp.id
  cidr_block = "10.0.10.0/24"
  
  tags = {
    Component = "Application"
  }
}

This separates concerns: networking team manages VPC, your team manages subnets within it.

Real-World Pattern

Shared Services Architecture:

# Shared services account has Transit Gateway
data "aws_ec2_transit_gateway" "shared" {
  filter {
    name   = "tag:Name"
    values = ["org-transit-gateway"]
  }
}

# Your account attaches to it
resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  transit_gateway_id = data.aws_ec2_transit_gateway.shared.id
  vpc_id             = aws_vpc.main.id
  subnet_ids         = aws_subnet.private[*].id
}

You don’t manage the Transit Gateway (owned by networking team), but you do manage your attachment to it.

Data Sources in Modules

When creating reusable modules:

# modules/app-server/main.tf

# Accept VPC ID as input
variable "vpc_id" {
  type = string
}

# Query VPC attributes without managing it
data "aws_vpc" "target" {
  id = var.vpc_id
}

# Use VPC attributes
resource "aws_security_group" "app" {
  vpc_id = data.aws_vpc.target.id
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [data.aws_vpc.target.cidr_block]
  }
}

This allows modules to work with any VPC without requiring full VPC management.

Next Steps

You’ve completed the Foundation track! You now understand:

State management
Resource replacement behavior
Variables and outputs
Remote state and locking
Importing existing infrastructure
Data sources vs resources

Next: Explore the Patterns track for real-world scenarios, or Operations track for production debugging.

Recommended: Tutorial 07: Count vs For_Each - Learn the right way to create multiple similar resources.

Cleanup

terraform destroy

Prerequisites

Brutal Truth Up Front

Prerequisites

What You’ll Build

The Exercise

Step 1: The Problem - Hardcoded Values

Step 2: Solution - Data Sources

Step 3: Inspect State

Step 4: Common Data Source Patterns

The Break (Intentional Failure Scenarios)

Scenario 1: Data Source Returns No Results

Scenario 2: Data Source Returns Multiple Results

Scenario 3: External Dependency Changes

The Recovery

When Data Source Fails

Converting Data Source to Resource

Exit Criteria

Key Lessons

Why This Matters in Production

Real-World Pattern

Data Sources in Modules

Next Steps

Cleanup

Additional Resources

Keywords

Need Help Implementing This?