๐Ÿ•PawData628 breeds

For data engineers

From CSV to insights: a practical playbook

This dataset ships as CSV and SQLite with a normalized provenance table. Here is how to wire it into real pipelines with Python, DuckDB, Airflow, dbt and Great Expectations โ€” without turning it into an enterprise archaeology dig.

1. Load with pandas or Polars

The canonical file is dog_breeds.csv (628 rows, 44 columns). Trait columns are on a 1โ€“5 scale, weight in pounds, height in inches.

python๐Ÿ•
import pandas as pd

df = pd.read_csv("data/dog_breeds.csv")

# Quick sanity pass
print(df.shape)               # (628, 44)
print(df["breed_group"].value_counts().head())
print(df["origin"].nunique(), "unique origins")

# Derived lifespan midpoint
df["lifespan_mid"] = (df["min_life_expectancy"] + df["max_life_expectancy"]) / 2
top = df.nlargest(10, "popularity_score")[["name", "breed_group", "origin"]]

Same, but on Polars โ€” faster and typed:

python๐Ÿ•
import polars as pl

df = pl.read_csv("data/dog_breeds.csv")
big_hounds = (
    df.filter((pl.col("breed_group") == "Hound") & (pl.col("size_category") == "large"))
      .select(["name", "origin", "max_weight_male"])
      .sort("max_weight_male", descending=True)
)

2. Query across files with DuckDB

DuckDB reads CSVs in place, no load step. Perfect for joining breeds to sources and images.

sql๐Ÿ•
-- Count sources per breed
SELECT b.name, COUNT(s.source_url) AS source_count
FROM read_csv_auto('data/dog_breeds.csv') b
LEFT JOIN read_csv_auto('data/dog_breed_sources.csv') s
  ON s.breed_key = b.key
GROUP BY b.name
ORDER BY source_count DESC
LIMIT 10;

-- Find breeds missing primary images
SELECT b.key, b.name
FROM read_csv_auto('data/dog_breeds.csv') b
LEFT JOIN (
  SELECT breed_key FROM read_csv_auto('data/dog_breed_images_inventory.csv')
  WHERE is_primary = 1
) i ON i.breed_key = b.key
WHERE i.breed_key IS NULL;

3. Use the shipped SQLite directly

dog_breeds.sqlite ships with tables dogs, dog_sources, dog_images, plus a reviewed_dogs view and dataset_meta.

python๐Ÿ•
import sqlite3, pandas as pd

con = sqlite3.connect("data/dog_breeds.sqlite")

# Widest lifespan ranges โ€” good candidates for outlier review
q = """
SELECT name, min_life_expectancy, max_life_expectancy,
       (max_life_expectancy - min_life_expectancy) AS span
FROM dogs
WHERE min_life_expectancy IS NOT NULL
ORDER BY span DESC
LIMIT 20
"""
outliers = pd.read_sql_query(q, con)

4. Validate with Great Expectations

Before publishing a pipeline downstream, pin your assumptions. A tiny expectation suite goes a long way:

python๐Ÿ•
import great_expectations as gx

ctx = gx.get_context()
src = ctx.sources.add_pandas("dogs")
asset = src.add_csv_asset("dog_breeds", filepath_or_buffer="data/dog_breeds.csv")

batch = asset.build_batch_request()
validator = ctx.get_validator(batch_request=batch)

validator.expect_column_values_to_not_be_null("key")
validator.expect_column_values_to_be_unique("key")
validator.expect_column_values_to_be_between("energy", 1, 5)
validator.expect_column_values_to_be_between("trainability", 1, 5)
validator.expect_column_values_to_match_regex("source_url", r"^https?://")
validator.expect_column_values_to_be_in_set(
    "review_status", ["reviewed", "seeded"]
)
validator.save_expectation_suite("dog_breeds.basic")

5. Orchestrate with Airflow

A realistic daily pipeline: ingest โ†’ validate โ†’ transform โ†’ publish. Below uses the TaskFlow API with Airflow 2.x:

python๐Ÿ•
from datetime import datetime
from airflow.decorators import dag, task

@dag(
    dag_id="dog_breeds_daily",
    start_date=datetime(2026, 4, 17),
    schedule="@daily",
    catchup=False,
    tags=["dogs", "demo"],
)
def dog_breeds_pipeline():

    @task
    def ingest() -> str:
        import shutil
        src = "/opt/airflow/data/dog_breeds.csv"
        dst = "/opt/airflow/warehouse/raw/dog_breeds.csv"
        shutil.copyfile(src, dst)
        return dst

    @task
    def validate(path: str) -> str:
        import pandas as pd
        df = pd.read_csv(path)
        assert df["key"].is_unique, "breed key must be unique"
        assert df[["energy", "trainability"]].max().max() <= 5
        return path

    @task
    def transform(path: str) -> str:
        import duckdb
        duckdb.sql(f"""
            CREATE OR REPLACE TABLE mart_breed_profile AS
            SELECT key, name, breed_group, origin, size_category,
                   (min_life_expectancy + max_life_expectancy) / 2.0 AS lifespan_mid,
                   popularity_score
            FROM read_csv_auto('{path}')
        """)
        duckdb.sql("EXPORT DATABASE '/opt/airflow/warehouse/marts' (FORMAT PARQUET)")
        return "/opt/airflow/warehouse/marts"

    @task
    def publish(_marts: str):
        # push to S3, Snowflake, BigQuery, ...
        print("published โœ…")

    publish(transform(validate(ingest())))

dog_breeds_pipeline()
Scheduling daily on a static reference dataset is obviously overkill โ€” in a real system you would trigger this on new snapshots only (e.g. an S3KeySensor or a dataset URI trigger). The shape of the DAG stays identical.

6. Model with dbt

Layer a few dbt models on top to separate raw, cleaned and mart tables:

sql๐Ÿ•
-- models/staging/stg_breeds.sql
select
    key                                    as breed_key,
    name                                   as breed_name,
    nullif(breed_group, '')                as breed_group,
    nullif(origin, '')                     as origin,
    lower(size_category)                   as size_category,
    (min_life_expectancy + max_life_expectancy) / 2.0 as lifespan_mid_years,
    popularity_score
from {{ source('raw', 'dog_breeds') }};

-- models/marts/dim_breed.sql
select
    b.*,
    s.source_registry,
    s.source_url
from {{ ref('stg_breeds') }} b
left join {{ source('raw', 'dog_breed_sources') }} s
       on s.breed_key = b.breed_key and s.is_primary = 1;

7. SQL exercises with provenance joins

The shipped SQLite normalizes provenance into dog_sources (many per breed) and dog_images (many per breed). That's the perfect playground for one-to-many joins, aggregation and window functions. Try these exercises against dogs โ‹ˆ dog_sources โ‹ˆ dog_images.

1

Breeds backed by only one source

LEFT JOIN dogs to dog_sources, GROUP BY breed and filter HAVING count(*) = 1. These are the rows you want a human to re-review.

sql๐Ÿ•
-- Breeds backed by exactly one source โ€” candidates for re-review
select b.key, b.name, count(s.source_id) as n_sources
from dogs b
left join dog_sources s on s.dog_id = b.id
group by b.key, b.name
having count(s.source_id) = 1
order by b.name;
2

Most-cited source registry

Join dog_sources by source_registry, count distinct dog_id per registry, sort desc. Shows which reference sites anchor the dataset.

sql๐Ÿ•
-- Most-cited source registry across the dataset
select s.source_registry,
       count(distinct s.dog_id) as n_breeds,
       count(*)                 as n_citations
from dog_sources s
group by s.source_registry
order by n_breeds desc
limit 10;
3

Reviewed coverage per breed group

Aggregate dogs by breed_group with sum(review_status = 'reviewed')::float / count(*) and rank with a window function. Spot the groups that still need curation.

sql๐Ÿ•
-- Reviewed coverage per breed group, ranked
select
    breed_group,
    count(*)                                         as n_total,
    sum(case when review_status = 'reviewed' then 1 else 0 end) as n_reviewed,
    round(100.0 * sum(case when review_status = 'reviewed' then 1 else 0 end)
          / count(*), 1)                             as pct_reviewed,
    rank() over (order by
        1.0 * sum(case when review_status = 'reviewed' then 1 else 0 end)
        / count(*) desc)                             as rnk
from dogs
where breed_group is not null
group by breed_group
order by rnk;
4

Image richness per breed

LEFT JOIN dog_images, COUNT(image_url) per breed, then bucket into {0, 1, 2+} with a CASE. Useful to seed a UI with real photos first.

sql๐Ÿ•
-- Image richness per breed, bucketed
with img_counts as (
    select b.id, b.name, count(i.image_url) as n_images
    from dogs b
    left join dog_images i on i.dog_id = b.id
    group by b.id, b.name
)
select
    case
        when n_images = 0 then '0 โ€” no photo'
        when n_images = 1 then '1 โ€” single photo'
        else '2+ โ€” multiple photos'
    end                      as bucket,
    count(*)                 as n_breeds
from img_counts
group by bucket
order by bucket;
5

Origin diversity per group

GROUP BY breed_group, COUNT(DISTINCT origin). Which groups are geographically concentrated and which are truly global?

sql๐Ÿ•
-- Origin diversity per breed group
select breed_group,
       count(*)                     as n_breeds,
       count(distinct origin)       as n_origins,
       round(1.0 * count(distinct origin) / count(*), 2) as diversity_ratio
from dogs
where breed_group is not null and origin is not null
group by breed_group
order by diversity_ratio desc;

8. Analysis ideas

  • Breed clustering. Run k-means / UMAP on the 1โ€“5 trait columns to discover "archetypes" beyond AKC groups.
  • Breed recommender. Weight traits by user answers (apartment? kids? novice owner?) and return a ranked list.
  • Origin mapping. Geocode origin and plot breed density per country โ€” correlate with climate and cold_tolerance.
  • Size vs lifespan regression. Confirm the well-known inverse relation between body mass and lifespan, per group.
  • Provenance audit. Count breeds per source_registry and flag solo-source rows for re-review.
  • Temperament NLP. Tokenize temperament strings and build a co-occurrence graph of descriptors.