Raw Record Source

{
  "$type": "site.standard.document",
  "bskyPostRef": {
    "cid": "bafyreicf2pg44dsqwjabzvj5bk53dygz5uu3pqpixk7tcto534lpxmccre",
    "uri": "at://did:plc:25rdn5elo5izoxrmtis34zuk/app.bsky.feed.post/3moh6ib53zfo2"
  },
  "coverImage": {
    "$type": "blob",
    "ref": {
      "$link": "bafkreignhmteea5m7wd2yxydoy73veflegnhg3mp6hpss4nuue5bod2gem"
    },
    "mimeType": "image/webp",
    "size": 249295
  },
  "path": "/gabrielhca/agentic-data-engineering-in-2026-how-to-build-pipelines-that-ai-agents-can-actually-use-4kgg",
  "publishedAt": "2026-06-17T00:50:11.000Z",
  "site": "https://dev.to",
  "tags": [
    "dataengineering",
    "ai",
    "python",
    "data",
    "gabrielh.dev"
  ],
  "textContent": "If you've spent the last few years building data pipelines, you know the drill: ingest, transform, load. Maybe some orchestration on top. Solid work — the kind that keeps dashboards green and analysts happy.\n\nBut something changed in 2026. Your pipeline's new consumer isn't a BI tool or a SQL query. It's an **AI agent** — and agents are a very different kind of hungry.\n\nWelcome to agentic data engineering. Buckle up.\n\n##  What's an \"Agentic\" Data System, Exactly?\n\nLet's back up a second. An **AI agent** is a system that perceives its environment, reasons about it, and takes actions to reach a goal — without needing a human to hold its hand at every step.\n\nThink of it like the difference between a GPS that tells you turn-by-turn directions (traditional AI) and one that books your hotel, reschedules your meeting, and orders food for when you arrive (agentic AI). One follows instructions. The other _acts_.\n\nFor agents to act, they need data. But not just any data — **context-rich, semantically meaningful, machine-readable data**. And that's where data engineers come in.\n\nThe cold truth: most existing data pipelines aren't built for this. They were designed for humans (or human-readable BI tools) as the end consumer. Agents need something different.\n\n##  The Context Engineering Problem\n\nHere's a concrete example. Say you have a `sales` table with a column called `status`. Values: `A`, `B`, `C`.\n\nA human analyst knows that `A = active`, `B = blocked`, `C = churned` because they read the Confluence doc from 2022 (the one that's three Notion migrations out of date). An AI agent? It has no idea. It'll guess — and guessing at 2am during an automated pipeline run is a great way to corrupt a report.\n\nThis is the **context engineering problem** : your data is technically correct but semantically opaque.\n\nContext engineering is the practice of designing data systems that embed rich, machine-readable context _alongside_ the data itself. Gartner has already flagged this: over 40% of agentic AI projects are predicted to fail by 2027 — not because the models are bad, but because the **data foundations are missing**. Bare schemas, unclear ownership, no lineage, inconsistent definitions.\n\nSound familiar?\n\n##  What Agents Actually Need From Your Pipeline\n\nLet's get practical. Here's what makes a data system \"agent-ready\":\n\n###  1. Rich Metadata and Semantic Descriptions\n\nEvery table, column, and field should have a description an agent can read and reason about — not just a name.\n\n\n\n    -- Bad: An agent sees \"status\" and guesses\n    CREATE TABLE sales (\n      id INT,\n      status VARCHAR(1)\n    );\n\n    -- Good: Metadata makes intent explicit\n    COMMENT ON COLUMN sales.status IS\n      'Customer lifecycle status. Values: A=active (paying), B=blocked (payment issue), C=churned (cancelled)';\n\n\nModern data catalogs (like DataHub, Amundsen, or OpenMetadata) can store this metadata in a way agents can query via API. If you're not using one, now is a very good time to start.\n\n###  2. Data Lineage That's Actually Up-to-Date\n\nAn agent running a pipeline needs to understand: where did this data come from? What transformations touched it? If something breaks, what else is affected?\n\nTools like **dbt** generate lineage graphs automatically from your SQL models. Here's a minimal dbt model with proper documentation:\n\n\n\n    # models/schema.yml\n    models:\n      - name: customer_lifetime_value\n        description: >\n          Calculates CLV per customer using the last 90 days of transactions.\n          Refreshed daily at 3am UTC. Source: raw.transactions joined with dim.customers.\n        columns:\n          - name: customer_id\n            description: Unique identifier. FK to dim.customers.customer_id\n          - name: clv_usd\n            description: Estimated lifetime value in USD. Null if customer has < 3 transactions.\n\n\nThat `description` block? An agent can read it, understand what the model does, and decide whether it's the right source for a given task. Without it, the agent is flying blind.\n\n###  3. Embeddings and Vector-Ready Outputs\n\nThis one trips people up. Traditional pipelines output structured tables. Agentic pipelines often need to _also_ output embeddings — vector representations of your data that LLMs can use for semantic search and RAG (Retrieval-Augmented Generation).\n\nHere's a simple example using Python and OpenAI's embedding API (or any open-source alternative like `sentence-transformers`):\n\n\n\n    from sentence_transformers import SentenceTransformer\n    import pandas as pd\n\n    model = SentenceTransformer(\"all-MiniLM-L6-v2\")\n\n    # Your product catalog as a dataframe\n    df = pd.read_parquet(\"products.parquet\")\n\n    # Generate embeddings from a meaningful text representation\n    df[\"text_repr\"] = df[\"name\"] + \". \" + df[\"description\"] + \". Category: \" + df[\"category\"]\n    df[\"embedding\"] = df[\"text_repr\"].apply(lambda x: model.encode(x).tolist())\n\n    # Write to a vector store (e.g., pgvector, Pinecone, Weaviate)\n    df[[\"product_id\", \"embedding\"]].to_parquet(\"products_embeddings.parquet\")\n\n\nThe key idea: you're not replacing your existing pipeline — you're **extending** it. The structured table feeds your dashboards. The embeddings feed your agents.\n\n###  4. Schema Drift Detection\n\nHere's a nightmare scenario: an upstream team renames a column. Your pipeline doesn't catch it. The agent downstream starts ingesting garbage. Nobody notices until a report goes out with completely wrong numbers.\n\nSchema drift detection is one of the highest-impact agentic data engineering tasks identified in the SIGMOD 2026 Data Agents tutorial. Integrate it into your orchestration:\n\n\n\n    # Using Great Expectations for schema validation\n    import great_expectations as gx\n\n    context = gx.get_context()\n\n    # Define expectation: column \"user_id\" must exist and be non-null\n    suite = context.add_expectation_suite(\"sales_suite\")\n    suite.add_expectation(\n        gx.expectations.ExpectColumnToExist(column=\"user_id\")\n    )\n    suite.add_expectation(\n        gx.expectations.ExpectColumnValuesToNotBeNull(column=\"user_id\")\n    )\n\n    # Run validation before anything touches the data\n    result = context.run_checkpoint(\"sales_checkpoint\")\n    if not result[\"success\"]:\n        raise ValueError(f\"Schema validation failed: {result}\")\n\n\nFail fast, fail loud. An agent that ingests bad data quietly is worse than a pipeline that crashes.\n\n##  A Mental Model: The Conveyor Belt vs. The Smart Warehouse\n\nHere's an analogy that might help it click.\n\nTraditional data pipelines are like a **conveyor belt in a factory** : raw materials go in one end, finished goods come out the other. Fast, reliable, predictable. But the conveyor belt doesn't know what it's carrying. It doesn't label boxes. It doesn't track where things came from. It just moves.\n\nAn agent-ready data system is more like a **smart warehouse** : every item has a barcode, a location, a history, and a description. Robots can navigate it because everything is labeled and organized. You can ask \"where are all the items from Supplier X that arrived in Q1?\" and get an instant answer.\n\nYour job in 2026? **Build the smart warehouse, not just the conveyor belt.**\n\n##  What to Do This Week\n\nYou don't need to rip out your stack and start over. Here's a practical starting point:\n\n  * **Audit your most critical tables** : Do they have column descriptions? Add them in your catalog or directly in dbt.\n  * **Enable lineage tracking** : If you're on dbt, it's already there. Expose it via the dbt API or push it to DataHub.\n  * **Pick one pipeline to make vector-ready** : Add an embedding generation step as a separate job. Don't break what works — extend it.\n  * **Add a schema validation checkpoint** : Use Great Expectations, Soda, or dbt tests. Run it before anything hits production.\n\n\n\nNone of this takes a week. The column descriptions alone can take an afternoon. But six months from now, when your team is deploying AI agents that actually work because your data is clean and semantically rich? You'll be very glad you started today.\n\n##  Conclusion\n\nThe rise of agentic AI doesn't make data engineers obsolete — it makes the craft harder and more important. Anyone can wire up an LLM to a database. Making that LLM reliably useful for autonomous agents? That requires real data engineering skill.\n\nContext engineering, lineage, schema validation, vector outputs — these aren't buzzwords. They're the new checklist. The engineers who build these foundations now are the ones who'll be building the most interesting systems in 2027.\n\nGo make your pipelines agent-ready. Your future AI coworkers are counting on you.\n\nAbs,\n\nGabriel Henrique Cardoso Antonio\n🔗 gabrielh.dev",
  "title": "Agentic Data Engineering in 2026: How to Build Pipelines That AI Agents Can Actually Use"
}