Ingesting YouTube Data with Apache Airflow 3.x

3/11/2025

Here’s the deal: you want to pull YouTube video data—title, description, transcript, the works—into DuckDB using Airflow. Most tutorials overcomplicate this. Let’s cut the fluff and get you a pipeline that actually works, is secure, and doesn’t break when you upgrade Airflow.

What You Actually Need

Airflow 3.x (use the new SDK, don’t fight it)
duckdb and youtube-transcript-api Python packages
A YouTube Data API key (get it from Google)

Install the Only Packages That Matter

pip install duckdb youtube-transcript-api

Don’t Hardcode Secrets—Use Airflow Connections

If you’re still using os.environ for API keys, stop. Go to Airflow UI → Admin → Connections. Add:

Conn Id: youtube_api
Conn Type: Generic (or HTTP)
Password: your YouTube API key

The Pipeline: What’s Worth Knowing

Trigger: Manual, with a topic param (default: "python programming").
Steps:
1. Search YouTube for up to 10 videos on the topic
2. Grab video URL, title, description, timestamp, chapters (if any), and transcript
3. Dump it all into DuckDB locally

The Only Code Patterns You Need

Get the API Key (the right way):

from airflow.hooks.base import BaseHook
def get_youtube_api_key():
    return BaseHook.get_connection('youtube_api').password

Search YouTube:

def search_youtube_videos(topic, max_results=10):
    api_key = get_youtube_api_key()
    url = f"https://www.googleapis.com/youtube/v3/search?part=snippet&type=video&maxResults={max_results}&q={topic}&key={api_key}"
    resp = requests.get(url)
    resp.raise_for_status()
    return [
        {
            'youtube_video_url': f"https://www.youtube.com/watch?v={item['id']['videoId']}",
            'title': item['snippet']['title'],
            'description': item['snippet']['description'],
            'timestamp': item['snippet']['publishedAt'],
            'video_id': item['id']['videoId']
        }
        for item in resp.json().get('items', [])
    ]

Get the Transcript (handles missing/disabled):

from youtube_transcript_api import YouTubeTranscriptApi
def get_video_transcript(video_id):
    try:
        ytt_api = YouTubeTranscriptApi()
        transcript = ytt_api.fetch(video_id)
        return ' '.join([snippet.text for snippet in transcript])
    except Exception:
        return None

Ingest to DuckDB:

import duckdb
def ingest_to_duckdb(videos, db_path):
    con = duckdb.connect(db_path)
    con.execute('''
        CREATE TABLE IF NOT EXISTS youtube_videos (
            youtube_video_url TEXT,
            title TEXT,
            description TEXT,
            timestamp TEXT,
            chapters TEXT,
            transcript TEXT
        )
    ''')
    for v in videos:
        con.execute('''
            INSERT INTO youtube_videos VALUES (?, ?, ?, ?, ?, ?)
        ''', [
            v['youtube_video_url'],
            v['title'],
            v['description'],
            v['timestamp'],
            str(v.get('chapters')),
            v.get('transcript')
        ])
    con.close()

Modern Airflow DAG (SDK style):

from airflow.decorators import dag
from airflow.providers.standard.operators.python import PythonOperator
import pendulum

@dag(
    dag_id="youtube_topic_to_duckdb",
    start_date=pendulum.datetime(2023, 1, 1),
    schedule=None,
    catchup=False,
    params={"topic": "python programming"},
    description="Search YouTube for a topic and ingest video data into DuckDB",
)
def youtube_topic_to_duckdb():
    def search_and_ingest_youtube_videos(topic, **context):
        # Use the helpers above
        pass
    PythonOperator(
        task_id="search_and_ingest",
        python_callable=search_and_ingest_youtube_videos,
        op_args=["{{ params.topic }}"],
    )

youtube_topic_to_duckdb()

Reality Check: What Sucks, What Doesn’t

Airflow Connections are the only sane way to manage secrets. Don’t fight it.
The new SDK is less annoying than the old style. Use it.
DuckDB is fast, local, and doesn’t need a server. Use it for this kind of pipeline.
YouTube’s API is rate-limited and transcripts are hit-or-miss. Handle missing data gracefully.

Who Actually Needs This

You want to analyze YouTube content for a specific topic, locally, without spinning up a cloud warehouse.
You want a repeatable, parameterized pipeline you can trigger on demand.
You care about maintainability and not getting burned by Airflow upgrades.

Actionable Advice

Don’t hardcode secrets. Use Airflow Connections.
Use the latest Airflow SDK and standard providers. Don’t copy-paste old DAG patterns.
Handle API failures and missing transcripts. YouTube is flaky.
Use DuckDB for local analytics. It’s fast and easy.
If you’re not running Airflow 3.x, upgrade. The new style is worth it.

That’s it. You now have a real, modern pipeline for ingesting YouTube data with Airflow. No more excuses.