Tannerthings
foundation 20 minutes beginner

Tutorial 06: Data Sources vs Resources

Query existing infrastructure without managing it - reduce state bloat and dependencies

Published February 19, 2025
Updated December 11, 2025
7 min read (1,292 words)
View Code on GitHub

Prerequisites

Complete these tutorials first: Tutorial 05: Importing Existing Infrastructure

Brutal Truth Up Front

Not everything needs Terraform management. Data sources let you reference existing infrastructure without importing it into state.

The tradeoff: reduced state complexity vs external dependencies. If someone deletes the VPC your data source queries, Terraform fails at plan time. Understanding when to use data sources vs resources prevents unnecessary state bloat and brittle configurations.

Prerequisites

  • Completed Tutorials 01-05
  • AWS account with existing resources
  • Understanding of state file purpose

What You’ll Build

Resources that reference existing infrastructure via data sources instead of importing or hardcoding values. You’ll see the difference in state files and understand dependency implications.

The Exercise

Step 1: The Problem - Hardcoded Values

Create main.tf with hardcoded IDs:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

resource "aws_instance" "app" {
  ami           = "ami-0c55b159cbfafe1f0"  # Hardcoded - will break when AMI is deprecated
  instance_type = "t2.micro"
  subnet_id     = "subnet-abc123"           # Hardcoded - breaks in different accounts
  
  tags = {
    Name = "app-server"
  }
}

Problems:

  • AMI IDs differ across regions
  • Subnet IDs are account-specific
  • Code isn’t portable
  • No validation that resources exist

Step 2: Solution - Data Sources

Replace with data sources:

terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

provider "aws" {
  region = "us-east-1"
}

# Query the latest Ubuntu AMI
data "aws_ami" "ubuntu" {
  most_recent = true
  owners      = ["099720109477"] # Canonical
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
  
  filter {
    name   = "virtualization-type"
    values = ["hvm"]
  }
}

# Query default VPC
data "aws_vpc" "default" {
  default = true
}

# Query subnet in default VPC
data "aws_subnets" "default" {
  filter {
    name   = "vpc-id"
    values = [data.aws_vpc.default.id]
  }
}

# Use data source outputs in resources
resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = tolist(data.aws_subnets.default.ids)[0]
  
  tags = {
    Name = "app-server"
  }
}

output "ami_id" {
  value = data.aws_ami.ubuntu.id
}

output "ami_name" {
  value = data.aws_ami.ubuntu.name
}

output "subnet_used" {
  value = tolist(data.aws_subnets.default.ids)[0]
}

Apply:

terraform init
terraform apply

Check outputs - you’ll see dynamically discovered values.

Step 3: Inspect State

terraform state list

Notice: Data sources appear in state, but they’re marked as “read-only”. Compare:

terraform state show aws_instance.app
terraform state show data.aws_ami.ubuntu

The instance has full state tracking. The data source only caches query results - Terraform doesn’t manage the AMI.

Step 4: Common Data Source Patterns

Querying Current AWS Account:

data "aws_caller_identity" "current" {}

output "account_id" {
  value = data.aws_caller_identity.current.account_id
}

Querying Existing Resources by Tag:

data "aws_vpc" "prod" {
  tags = {
    Environment = "production"
  }
}

data "aws_security_group" "db" {
  tags = {
    Name = "database-sg"
  }
}

Querying AWS-Managed Resources:

data "aws_availability_zones" "available" {
  state = "available"
}

resource "aws_subnet" "example" {
  count             = 3
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
}

The Break (Intentional Failure Scenarios)

Scenario 1: Data Source Returns No Results

Query a non-existent VPC:

data "aws_vpc" "nonexistent" {
  tags = {
    Name = "does-not-exist"
  }
}

resource "aws_instance" "app" {
  ami           = data.aws_ami.ubuntu.id
  instance_type = "t2.micro"
  subnet_id     = data.aws_vpc.nonexistent.id  # Will fail
}

Run plan:

terraform plan

Error: no matching VPC found. Terraform fails at plan time - it can’t create a plan without knowing the VPC ID.

Scenario 2: Data Source Returns Multiple Results

Query AMI without most_recent:

data "aws_ami" "ubuntu" {
  # most_recent = true  # Commented out
  owners = ["099720109477"]
  
  filter {
    name   = "name"
    values = ["ubuntu/images/hvm-ssd/ubuntu-jammy-22.04-amd64-server-*"]
  }
}

Error: Your query returned more than one result. Please try a more specific search criteria.

Data sources that return single values (like aws_ami, aws_vpc) require unique matches.

Scenario 3: External Dependency Changes

Someone deletes the VPC your data source queries. Next plan:

terraform plan

Fails immediately - can’t find VPC. Your infrastructure is now undeployable until the VPC is restored or configuration is updated.

This is the risk of external dependencies.

The Recovery

When Data Source Fails

If infrastructure referenced by data source is deleted:

Option 1: Create the missing resource

# Replace data source with resource
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "main"
  }
}

# Update references
resource "aws_instance" "app" {
  # ...
  subnet_id = aws_subnet.example.id
}

Option 2: Update data source query

data "aws_vpc" "main" {
  default = true  # Fallback to default VPC
}

Converting Data Source to Resource

If you decide to manage previously-queried infrastructure:

  1. Remove data source from HCL
  2. Add resource block
  3. Import existing infrastructure:
terraform import aws_vpc.main vpc-abc123
  1. Verify plan shows no changes

Exit Criteria

You understand this tutorial if you can:

  • Differentiate when to use data sources vs resources
  • Query AWS resources using filters and tags
  • Handle data source errors gracefully
  • Explain state implications of data sources vs resources
  • Convert between data sources and resources

Key Lessons

  1. Data sources query, resources manage - choose based on ownership
  2. Data sources reduce state complexity - fewer resources to track
  3. External dependencies are fragile - queried resources can disappear
  4. Filters must be specific - ambiguous queries fail
  5. State contains cached queries - data sources refresh on plan

Why This Matters in Production

Use data sources when:

  • Querying AWS-managed resources (availability zones, AMIs)
  • Referencing infrastructure owned by another team
  • Reducing blast radius (not responsible for VPC lifecycle)
  • Portability across accounts/regions

Use resources when:

  • You own the lifecycle
  • Changes require coordination
  • Compliance requires audit trail of all managed infrastructure

In FedRAMP High environments:

# Don't manage the entire VPC - networking team owns it
data "aws_vpc" "fedramp" {
  tags = {
    Compliance = "FedRAMP-High"
    Environment = "production"
  }
}

# Do manage your application subnet
resource "aws_subnet" "app" {
  vpc_id     = data.aws_vpc.fedramp.id
  cidr_block = "10.0.10.0/24"
  
  tags = {
    Component = "Application"
  }
}

This separates concerns: networking team manages VPC, your team manages subnets within it.

Real-World Pattern

Shared Services Architecture:

# Shared services account has Transit Gateway
data "aws_ec2_transit_gateway" "shared" {
  filter {
    name   = "tag:Name"
    values = ["org-transit-gateway"]
  }
}

# Your account attaches to it
resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  transit_gateway_id = data.aws_ec2_transit_gateway.shared.id
  vpc_id             = aws_vpc.main.id
  subnet_ids         = aws_subnet.private[*].id
}

You don’t manage the Transit Gateway (owned by networking team), but you do manage your attachment to it.

Data Sources in Modules

When creating reusable modules:

# modules/app-server/main.tf

# Accept VPC ID as input
variable "vpc_id" {
  type = string
}

# Query VPC attributes without managing it
data "aws_vpc" "target" {
  id = var.vpc_id
}

# Use VPC attributes
resource "aws_security_group" "app" {
  vpc_id = data.aws_vpc.target.id
  
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = [data.aws_vpc.target.cidr_block]
  }
}

This allows modules to work with any VPC without requiring full VPC management.

Next Steps

You’ve completed the Foundation track! You now understand:

  • State management
  • Resource replacement behavior
  • Variables and outputs
  • Remote state and locking
  • Importing existing infrastructure
  • Data sources vs resources

Next: Explore the Patterns track for real-world scenarios, or Operations track for production debugging.

Recommended: Tutorial 07: Count vs For_Each - Learn the right way to create multiple similar resources.

Cleanup

terraform destroy

Additional Resources

Keywords

terraform data sources data blocks query infrastructure terraform dependencies state management

Need Help Implementing This?

I help government contractors and defense organizations modernize their infrastructure using Terraform and AWS GovCloud. With 15+ years managing DoD systems and active Secret clearance, I understand compliance requirements that commercial consultants miss.