In this episode, we dive into one of the core engines behind Microsoft Fabric’s modern analytics ecosystem—Dataflows Gen2. We explore how this next-generation data ingestion and transformation tool is shaping the way organizations prepare, refine, and deliver data for analytics.

We start with an overview of Microsoft Fabric, Microsoft’s unified analytics platform that brings together data engineering, data science, real-time analytics, Power BI, data factory, and lakehouse capabilities in a single integrated environment. Fabric is designed to simplify complex data estates, and Dataflows Gen2 plays a pivotal role in achieving that simplicity.

Listeners get a clear explanation of what Gen2 actually means: improved architecture, better performance, expanded connectivity, stronger data prep features, and support for more complex data ingestion scenarios. Gen2 builds on traditional Power Query–based dataflows but offers scalability and reliability fit for enterprise-grade pipelines.

The episode then walks through how dataflows work inside Fabric—connecting to almost any data source, ingesting structured or unstructured data, performing visual transformations using Power Query Online, and loading refined data into various destinations such as lakehouses, warehouses, datasets, or downstream pipelines.

We break down the step-by-step process of creating a dataflow: setting up your workspace, connecting to data sources, using Power Query to clean and shape your data, and selecting the right destination depending on your analytics or storage needs. This is followed by a look at transformation best practices, including merging tables, handling incremental refresh, optimizing queries, and ensuring scalability.

Apple Podcasts podcast player iconSpotify podcast player iconYoutube Music podcast player iconSpreaker podcast player iconPodchaser podcast player iconAmazon Music podcast player icon

You can make your Power BI dataflows reusable by using modular design and parameterization. You should not use hard-coded logic. Microsoft Fabric gives you Dataflows Gen2. This is a simple and standard way to manage and change data. It uses little code. This helps you reuse your data changes. It also keeps your data the same in all projects. Modularity in Power BI dataflows helps you:

  • Make reusable entities and tables. This makes it easy to fix problems and keep things working.
  • Keep data transformation separate from reports. You only need to update logic one time.
  • Prepare data in one place. This makes things faster and helps you make fewer mistakes.

When you use a Dataflow Reusable approach, you save time. You also make fewer mistakes and can grow your analytics with confidence.

Key Takeaways

  • Make dataflows in small pieces. This helps you update them faster and saves time.
  • Use parameters to change how dataflows work. You can use the same steps for different things without doing it all again.
  • Do not use fixed logic in your dataflows. Use parameters so your dataflows can change when needed.
  • Write down each step you take to prepare data. Good notes help others know and trust your dataflows.
  • Keep one main place for your data. This keeps your data the same and helps your team trust it.

9 Surprising Facts About Power BI Dataflows

  1. Dataflows let you build reusable ETL in the cloud using Power Query — the same engine used in Power BI Desktop — so you can create transformations once and share them across many reports and datasets.
  2. Data stored by dataflows can be persisted as Common Data Model (CDM) folders, making the cleansed data accessible not only to Power BI but also to other Azure services like Synapse and machine learning tools.
  3. With Azure Data Lake Gen2 integration, dataflows can store their CDM output directly in your lake, giving you full control over enterprise-grade storage and governance for self-service ETL artifacts.
  4. Computed entities let you chain dataflows: one entity can reference and transform another entity inside the same or different dataflow, enabling modular ETL pipelines and reducing duplicated logic.
  5. Dataflows support incremental refresh at the entity level, which can drastically reduce processing time and resource usage for large datasets by processing only changed partitions.
  6. Power Query Online in dataflows includes data profiling and column quality indicators in the browser, so you can detect nulls, distinct counts, and distribution without leaving the service.
  7. Linked entities allow referencing entities from other dataflows or workspaces without copying data, enabling logical reuse and easier maintenance across projects and teams.
  8. Dataflows can be scheduled independently from report refreshes, so you can control and optimize when heavy ETL runs occur, improving report performance and reducing unnecessary compute during business hours.
  9. Although often thought of as a BI-only feature, dataflows are increasingly used as an enterprise data preparation layer — acting as a shared, governed source of truth for analytics, data science, and downstream data services.

Dataflow Reusable Principles in Power BI

Modular Design

You can make your dataflow reusable with modular design. When you split your dataflow into small parts, each part is easy to change. Dataflows Gen2 in Microsoft Fabric helps you do this. You can use dataflows to get data ready before making reports. You only need to build a dataflow one time. Then you can use it in many places. This saves time and keeps your logic the same everywhere.

Parameterization

Parameterization is important for making your dataflow reusable. You can use parameters to control how your dataflow works in different cases. For example, you can set up parameters for different data sources or places. This means you do not need a new dataflow for each case.

  • Parameterization lets you use the same logic for testing, building, and running your dataflow.
  • You can manage connections and settings for each place.
  • You do not have to repeat work and your dataflow is reusable for your whole group.

Avoiding Hard-Coded Logic

If you do not use hard-coded logic, your dataflow is more flexible. Hard-coded values tie your dataflow to one source or one case. Instead, use parameters and small steps. Dataflows Gen2 helps you do this. You can connect to many data sources and use settings that change. This helps your dataflow stay reusable when your needs change.

Tip: Always keep your data transformation logic separate from your reporting logic. This makes updates simple and keeps your dataflow reusable for new projects.

Power BI Dataflows: Why Reusability Matters

Definition and Benefits

It is important to know why reusability matters in data projects. When you use reusable dataflows, you build a strong base for analytics. Power BI dataflows let you make transformation logic once and use it many times. This helps keep your data neat and easy to find.

Reusable dataflows have many good points:

  • You can use the same transformation logic for every report. This means all reports follow the same rules.
  • You do not have to handle as much business logic. This makes updates quick and simple.
  • You have one main place for your data. Everyone on your team can trust the results.
  • You save time because you do not repeat steps for each report.
  • You make fewer mistakes. When data is always the same, there are less errors.

With Dataflows Gen2, you get even more features. You can connect to many sources and get your data ready in one spot. You can share special datasets with reports and dashboards. This helps your team work together and keeps analytics strong.

Note: When you prepare data in one place, it helps you turn raw data into useful ideas.

Common Pitfalls

If you do not think about reusability, you can have problems. You might copy and paste logic into different reports. This can cause mistakes and waste time. It can also be hard to keep your data the same.

Some common problems are:

  • Making new dataflows for every project instead of reusing one.
  • Using values that cannot change when they should.
  • Not writing down your transformation steps.
  • Not preparing your data in one main place.

Dataflows Gen2 helps you stay away from these problems. It gives you a cloud tool for getting, changing, and storing data. You can handle your data in one spot and share it with your whole group. This makes your analytics work smooth and easy to trust.

Building Reusable Dataflows

Building Reusable Dataflows

Designing for Reuse

You can make reusable dataflows by following simple steps. First, open Power BI Service and go to your workspace. Click New Item and pick Dataflow. Add a table and choose Power BI dataflow (Legacy). Check your login and click Next. Pick the tables you want and start changing your data. Doing each step in order helps you build strong data pipelines.

Power Query Online lets you do the ETL process by yourself. You can get, change, and load data without waiting for IT. This tool helps you work on your own and makes dataflows easy to use again. You can make dataflows that do one job well. For example, you can build a staging dataflow that only brings in raw data. Other dataflows can use this staging dataflow to do more changes.

When you want to reuse dataflows, think about modularity. Break your dataflows into small pieces. Each piece should do just one thing. This makes it easy to fix or update one part without changing everything. You can use Dataflows Gen2 to manage these pieces. This way helps you work faster and lets your team help themselves.

Tip: Only bring in the data you need. Do not load everything just in case. This keeps your dataflows quick and simple to handle.

Using Parameters and Templates

Parameters and templates help you use dataflows in many workspaces. You can set up Power Query Parameters or Fabric Variables to handle item IDs. This makes it easy to move your dataflows to new workspaces. You do not have to change your logic every time.

You can use parameters in Power BI Desktop to pick where to send your dataflow. Save your file as a template. Now you can use your dataflow in different places with just a few clicks. This way helps you work on your own and makes fewer mistakes. You can switch between test and real use easily.

  • Power Query Parameters and Fabric Variables help you set up each workspace.
  • Templates let you use the same setup for new dataflows.
  • You can move your dataflows from building to testing to real use with less work.

Using parameters and templates makes your dataflows flexible. You can help many teams and projects without starting over. This is important for self-service data engineering.

Single Source of Truth

A single source of truth means you keep one main set of data. This helps everyone trust the results. You can use Dataflows Gen2 to make one semantic model. All your reports and dashboards use this model. This lowers mistakes and keeps your data the same.

When you keep your data in one place, you make governance better. You can see who uses the data and how they use it. This helps you follow rules and manage data the right way. Dataflows Gen2 makes it easy to handle data classification and joining. You can set rules for sensitive data and make sure everyone follows them.

Note: A single source of truth helps your team make good choices. Everyone uses the same numbers and words.

Documentation Practices

Good documentation keeps your dataflows easy to use for a long time. You should write down every step you take to prepare your data. This helps others know what you did. Give each step a clear name. Add notes to explain why you made certain choices.

  • Say who owns and updates each dataflow.
  • Add a description to your dataflow and each table.
  • Keep track of which reports and dashboards use your dataflow.
  • Use clear names for each step you make.

Here is a table of best ways to document and organize:

Best PracticeDescription
TemplatesUse templates to keep your dataflows the same and easy to use again.
Version ControlTrack changes with tools like Azure DevOps or Github.
Separate Semantic and Reporting LayersKeep one main model for all reports to stop repeats and make security better.

Naming rules are important too. Use names like DF_HubSpot_Contacts_Daily. Add DF_ at the start for dataflows. Show the environment, like DEV_, TEST_, or PROD_. Pick a style for writing names and use it everywhere. This makes your dataflows easy to find and understand.

How you set up your workspace matters for reusability. Put your dataflows in different workspaces. This helps people work on their own and lets you control security for each team. Do not put all dataflows in one workspace. Use different workspaces for different jobs. This helps you grow your data engineering and support more users.

Tip: Self-service works best when everyone can find, use, and trust the dataflows they need.

Examples & Pitfalls in Power BI Dataflows

Examples & Pitfalls in Power BI Dataflows

Refactoring One-Off Dataflows

Sometimes, you start with a dataflow for just one report. This can cause confusion and more work later. Refactoring means you change messy pipelines into simple building blocks. For example, Trident Advisory Group had many pipelines for different reports. They put these together into one dataflow that gathers raw event logs. Now, business analysts can make new reports without waiting for IT. This change made things faster and easier. Reports that took three hours now finish in fifteen minutes. Your team also uses the same logic for every report.

A table shows the main benefits:

BenefitDescription
Efficiency ImprovementRefresh cycles drop from 3 hours to 15 minutes, freeing up bandwidth.
Self-Service AnalyticsBusiness analysts can build ad-hoc reports directly on the dataflow.
Data GovernanceCentralizing data transformation logic ensures consistent measures across reports.

When you refactor, check how changes affect other apps. Make sure to combine data that does not have a clear owner. This helps keep your data safe and organized.

Mistakes to Avoid

You might make mistakes when trying to reuse dataflows. One mistake is copying logic into every report instead of using one dataflow. This can cause errors and make updates hard. Another mistake is using values that cannot change, which makes things less flexible. Not writing down your steps can confuse your team.

Dataflows Gen2 helps you avoid these problems. You can use a visual tool to build pipelines with little code. Power Query is easy to use, and you can do advanced changes if needed. Always remember to set up refresh times for your reports. When you move things between places, Dataflows Gen2 makes it easy to deploy.

Tip: Keep your data transformation logic in one place. This helps your team trust the data and keeps everything organized.


You can make your Power BI dataflows reusable by using modular design, parameters, and clear documentation. First, look at your current dataflows. Then, change them with Dataflows Gen2 in Microsoft Fabric. If you use best practices, you get long-term benefits:

  • You manage data with fewer models.
  • More people use Power BI tools.
  • Reports run faster and you do less governance work.

Reusable dataflows help your team finish work quickly and trust the results.

power query in power bi and power query online for create dataflows

What is a Power BI dataflow and how does it relate to power query?

A Power BI dataflow is a collection of tables created and managed in the Power BI service that uses Power Query to extract, transform, and load data from various sources; it centralizes data preparation so datasets in Power BI and downstream reports can reuse the same transformed data, reducing duplication and supporting data within large data solutions.

How do I create a dataflow in the Power BI service?

To create a dataflow, open a workspace in the Power BI service, choose “Create” then “Dataflow,” use the online Power Query editor to connect to data from various sources (on-premises, Azure, or cloud), define entities and transformations, and save; you can also create multiple dataflows to modularize ETL for a data warehouse or large data volumes.

Can I use dataflows in Power BI Desktop and how do they connect to datasets?

Yes, you can use dataflows in Power BI Desktop by connecting to Power BI dataflows as a data source; once connected, your Power BI report uses the dataflow data as a source and you can combine it with other data sources to build datasets in Power BI and publish reports back to the Power BI service.

What are common use cases for Power BI dataflows and create dataflows scenarios?

Common use cases include centralizing common data preparation logic across multiple reports, creating a single source of truth for the enterprise, preparing data for a data warehouse or Azure Data Lake Storage Gen2, handling large data volumes with scheduled refresh, and enabling reuse across multiple Power BI datasets and Power Platform apps.

How does dataflow refresh work and how do I schedule a dataflow refresh?

Dataflow refresh runs the Power Query transformations defined in the dataflow to update entities with the latest data; you can configure refresh schedule settings in the Power BI service to run on a timetable or use Power Automate for event-driven refreshes; note that refresh frequency and parallelism can depend on your Power BI Premium subscription or capacity.

What are the differences between dataflows and datasets in Power BI?

Dataflows focus on upstream ETL using Power Query Online to create reusable tables stored as dataflow data, while datasets are optimized for analytics and modeling in Power BI (tables, relationships, measures); datasets typically consume dataflows as sources so multiple datasets and reports can share the same prepared data.

Can dataflows store data in Azure and what is ADLS Gen2 support?

Yes, Power BI dataflows can store entities in Azure Data Lake Storage Gen2 (ADLS Gen2) when using the “Enhanced compute engine” or linking to a workspace backed by ADLS Gen2, enabling big data scenarios, direct consumption by Azure Data Factory or other tools, and integration with a broader data warehouse architecture.

Do I need Power BI Premium to use dataflows and what features require premium?

Dataflows are available in the Power BI service for Pro users with basic capabilities, but features like larger storage, higher refresh frequency, linked entities across workspaces, and compute isolation for large data volumes often require Power BI Premium or Premium Per User to ensure scalable performance and enterprise-grade refresh schedules.

How do dataflows handle on-premises data and connectors within the Power BI service?

To connect to on-premises data sources from a dataflow, configure and use the on-premises data gateway in the Power BI service, select the appropriate connector in Power Query Online, and provide credentials; this enables dataflows to extract data from on-premises databases and refresh into the cloud for use by datasets and reports.

What is the difference between using a dataflow and using Azure Data Factory for ETL?

Dataflows are designed for business-centric self-service ETL with Power Query transformations integrated into Power BI and the Power Platform, ideal for reuse across datasets and reports; Azure Data Factory targets enterprise-scale orchestration and complex pipelines for a data warehouse, big data processing, and advanced transformations across many systems, though both can be used together when integrating data into ADLS Gen2.

How do I manage multiple dataflows and reuse transformations across workspaces?

You can organize dataflows within workspaces in the Power BI service, use Linked Entities to reuse tables from another dataflow, create a new dataflow that references existing ones, and manage access via workspace permissions so multiple Power BI solutions and Power BI reports can consistently consume shared data prepared within the dataflows in Power BI.

🚀 Want to be part of m365.fm?

Then stop just listening… and start showing up.

👉 Connect with me on LinkedIn and let’s make something happen:

  • 🎙️ Be a podcast guest and share your story
  • 🎧 Host your own episode (yes, seriously)
  • 💡 Pitch topics the community actually wants to hear
  • 🌍 Build your personal brand in the Microsoft 365 space

This isn’t just a podcast — it’s a platform for people who take action.

🔥 Most people wait. The best ones don’t.

👉 Connect with me on LinkedIn and send me a message:
"I want in"

Let’s build something awesome 👊

Summary

Running Is Your Dataflow Reusable—or a One-Trick Disaster? as a short, inflexible pipeline is risky. In this episode, I dig into how many “working” dataflows are secretly ticking time bombs—fragile against schema drift, hardcoding, and scale. I share a 60-second checklist you can run on yours right now to see if it’s durable or destined for failure.

You’ll hear how modular queries, parameterization, Delta Lake with schema control, and smart partitioning can turn a brittle flow into a resilient asset. We’ll also cover when to move heavy logic out of dataflows altogether and into notebooks or your lakehouse layer.

By the end, you’ll know whether your pipeline is safe for production, or just this week’s disaster waiting to happen.

What You’ll Learn

* Why many dataflows “working today” are fragile tomorrow

* How to modularize transformations and use reusable M functions

* Why you must parameterize values instead of hardcoding them

* The role of Delta Lake (schema enforcement, ACID, versioning) in stability

* Best practices for partitioning + incremental refresh

* How to detect when your flow is outgrowing its design and you need to offload to Spark / notebook layers

Full Transcript

Picture this: your lakehouse looks calm, clean Delta tables shining back at you. But without partitioning, schema enforcement, or incremental refresh, it’s not a lakehouse—it’s a swamp. And swamps eat performance, chew through storage, and turn your patience into compost. I’ve seen it happen in more tenants than I care to count.

Here’s the fix: stick around, because I’ll give you a 60‑second checklist you can run against any dataflow—parameters, modular queries, Delta targets, and partitioning. Dataflows Gen2 use Power Query/M, so the same rules about modular queries and functions still apply.

Subscribe at m365.show, grab the checklist, and let’s see if your “working” dataflow is actually a time bomb.

Why Your 'Working' Dataflow is Actually a Time Bomb

The real issue hiding in plain sight is this: your dataflow can look fine today and still be hanging by a thread. Most people assume that if it refreshes without error, it’s “done.” But that’s like saying your car is road‑worthy because it started once and the check engine light is off. Sure, it ran this morning—but what happens when something upstream changes and the entire pipeline starts throwing fits? That silent culprit is schema drift. Add one column, shift field order, tweak a data type, and your flow can tip over with no warning.

For most admins, this is where the blind spot kicks in. The obsession is always: “Did it refresh?” If yes, gold star. They stop there. But survival in the real world isn’t just about refreshing once; it’s about durability when change hits. And change always shows up—especially when you’re dealing with a CRM that keeps sprouting fields, an ERP system that can’t maintain column stability, or CSV files generously delivered by a teammate who thinks “metadata” is just a suggestion. That’s why flex and modularity aren’t buzzwords—they’re guardrails. Without them, your “fixed” pipe bursts as soon as the water pressure shifts.

And the fallout is never contained to the person who built the flow. Schema drift moves like a chain reaction in a chemical lab. One new field upstream, and within minutes you’ve got a dashboard graveyard. There’s Finance pushing panic because their forecast failed. Marketing complaints stack up because ad spend won’t tie out. The exec team just wants a slide with charts instead of cryptic error codes. You—the admin—are stuck explaining why a “tiny change” now has 20 dashboards flashing red. That’s not user error; that’s design fragility.

Here’s the blunt truth: Dataflows Gen2, and really every ETL process, is built on assumptions—the existence of a column, its data type, order, and consistency. Break those assumptions, and your joins, filters, and calculations collapse. Real‑world schemas don’t sit politely; they zigzag constantly. So unless your dataflow was built to absorb these changes, it’s fragile by default. Think of it like relying on duct tape to hold the plumbing: it works in the moment, but it won’t survive the first surge of pressure.

The smart move isn’t hope. It’s defense. If schema drift has already burned you, there’s a quick diagnostic: run the 60‑second checklist. First, does your flow enforce schema contracts or land data in a Delta table where schema evolution is controlled? Second, does it include logic to ingest new columns dynamically instead of instantly breaking? Third, are your joins coded defensively—validating types, handling nulls—rather than assuming perfect input? If you can’t check those boxes, then you’re not done; you’ve just delayed failure.

And before you think, “Great, so everything’s doomed,” there’s mitigation available. Fabric supports strategies like dynamic schema handling in Mapping Dataflows and parameterizing queries so they adapt without rewrites. CloudThat and others highlight how dynamic schema detection plus metadata repositories for mappings can reduce the fragility. Those aren’t silver bullets, but they keep your pipelines from detonating every time a developer adds a field on Friday afternoon.

One important caveat: even a “healthy” Dataflow Gen2 design has limits. They don’t handle massive datasets as well as Spark, and wide joins or deep transformations can turn refreshes into crawl speed. If you know volumes will hit high scale, offload the heavy work to Spark notebooks and keep Dataflows for lighter prep. Key2Consulting and CloudThat both call this out in practice. Treat Dataflows as one tool in the kit—not the hammer for every job.

Bottom line, a so‑called working dataflow that can’t weather schema drift or large‑scale growth isn’t reliable. It’s fragile, adding silent debt into your system. And disposable pipelines, stacked on top of each other, create a tower of quick fixes nobody wants to maintain.

That puts us at the next layer of the problem: the bad habits baked in from the very start. Think your setup looks clean? Let’s run it against the three sins that turn “working” pipelines into a nonstop ticket machine.

The Three Deadly Sins of Dataflow Design

Here’s where most dataflows go sideways: the three deadly sins of design. They’re simple, they’re common, and they guarantee headaches—hardcoding values, piling on spaghetti logic, and ignoring the fact that scale exists. We’ve all slipped into them because in the moment they look like shortcuts. The problem is when those “shortcuts” snake into production and you’re left with fragile pipelines no one wants to untangle.

First up: hardcoding. You’re tired, the refresh is failing, so you paste a file path or static date directly into your query. It works. For now. But what you’ve actually done is cement brittle assumptions into your pipeline. The second someone moves that file, renames a table, or asks for the same logic in a different workspace, the entire thing snaps. A better fix is dead simple—centralize values. Either store them as parameters inside your dataflow or, if you’re managing multiple environments, put your config data in a metadata table. SQL or Cosmos DB both work fine for this. Then your flows don’t care which folder or server they’re pointing at—you just swap the parameter, and everything still refreshes.

Sin two: spaghetti logic. It usually starts clean—three steps to connect, transform, and load. Fast forward a few months and you’ve got twenty chained queries full of nested merges, conditional splits, and mystery filters no one admits to writing. At that point, your dataflow feels less like logic and more like a plate of noodles that shocks you if you pick the wrong one. Debugging is guesswork, collaboration is impossible, and governance goes out the window because nobody can even explain where the fields came from. The fix? Break the work into named, single-purpose queries. Use functions in Power Query M for reusable bits like date handling or path parsing. Yes, Dataflows Gen2 supports this, but remember: reusing those blocks across workspaces has limits. If you need true reuse across your tenant, build the canonical version in a single “source of truth” dataflow or push complex transformations down into your lakehouse or notebook layer. Bottom line—write logic in chunks that humans can read tomorrow, not in one monster chain nobody can ever touch again.

Last sin: ignoring scale. On demo-sized test data, everything looks magical. Four thousand rows? Instant refresh. Then the real dataset drops—four million rows with concurrent refreshes—and suddenly your dataflow is standing in quicksand. It backs up the refresh queue, hogs compute, and everything else grinds to a halt. This isn’t Fabric being weak; it’s your design never accounting for production volume. Small joins turn into bottlenecks, wide transformations chew through memory, and incremental refresh gets ignored until jobs start timing out. If you actually want things to run, test with production-like volumes early. Use coalesce to cut down partition counts and repartition strategically so the engine isn’t juggling thousands of tiny chunks. And give yourself a hard rule: if your refresh times balloon past usable, it’s time to either push the heavy transformations into a Spark notebook or tune the partitioning until they behave. Test at scale, or production will test you instead.

Here’s the kicker—these sins don’t live in isolation. Hardcode enough values and you’ll be rewriting every time the environment shifts. Let spaghetti logic grow and you’re one step away from a full black-box nobody understands. Ignore scale, and eventually workloads pile up until the whole refresh ecosystem collapses. Each one of these mistakes adds debt; mix them and you’re trading resilience for fragility at compound interest.

Fixing them isn’t about perfection—it’s about giving yourself guardrails. Safe defaults like parameters, modular queries, and realistic testing keep your pipelines stable enough to survive the normal chaos of changing schemas and growing datasets. The trick now is turning those guardrails into your standard operating mode, so you build flows that adapt instead of collapse.

And that’s the bridge to the real differentiator—the design habits that actually make dataflows reusable instead of disposable.

The Secret Sauce: Modularity and Parameterization

So how do you actually keep a dataflow from turning into a throwaway experiment? The answer comes down to two things: modularity and parameterization. Skip those and you’re not building pipelines—you’re cobbling together one-offs that collapse the minute requirements shift. A reusable dataflow is one that can drop into another project, adapt with minimal tweaks, and still function. Anything else is just glorified copy-paste maintenance.

Modularity starts with carving transformations into standalone steps. Picture a block of logic that standardizes customer names. If you bury it inside a bloated 50-step chain, it’s stuck there forever. Pull that same logic into a separate function, and suddenly it’s a reusable tool across any data source. That’s the shift: building small, utility-style steps with a single clear purpose, instead of fusing everything into one unreadable chain. Power Query M makes this easy—you can write custom functions with a defined input and output, then reuse them across queries in Dataflows Gen2. The habit to build here is simple: every complex transform should be a named function with a clear input/output signature and a single parameter bag. If you can’t explain it in one sentence, split it.

Parameters are the other half of the equation, and yet they’re treated like garnish by most admins. Without them, you end up hardcoding connector strings, file paths, or query dates into every pipeline. It feels faster in the moment, but the second you switch environments or adjust schedules, you’re hacking code again. Put all those values into parameters—pipeline-level parameters in Fabric, or better yet, pull them from a centralized metadata store if you need cross-workspace consistency. Then a path or key change becomes a one-line tweak instead of 15 near-identical rewrites. CloudThat and others call this metadata-driven design, and it’s the antidote to Groundhog Day maintenance.

A practical example: handling date filters. Without parameters, you’re littering static “today minus one” logic in multiple queries. Change the rule, and you’re chasing it everywhere. Parameterize that filter once—or wrap it in a custom function—and use it consistently. Next time the business decides “yesterday” should actually mean “last business day,” you adjust it in one place. It’s the same principle Excel power users learned when they stopped copy-pasting macros across 15 files. Update once, fix everywhere. That’s reusability in action.

When you approach dataflows this way, they start working like Lego kits. You snap in the reusable “clean customer” piece, click it onto the “validate transaction” piece, and then add the “calculate totals” tile. Building a new workflow doesn’t require duct tape or rewriting—it’s just assembly. The bigger payoff is maintenance. When a policy changes, you patch the block once and every dependent dataflow inherits it automatically. No black-box chain reactions, no endless debugging.

Now, here’s the caveat: Dataflows Gen2 doesn’t give you unlimited reusability across workspaces. Functions and queries don’t magically sync tenant-wide. If you truly want global reuse and version control, push heavy shared logic down into the lakehouse or a Fabric notebook. Keep workspace-level dataflows small, modular, and parameterized, then centralize the big, shared transformations at the storage or notebook layer. Think of it as a hybrid model: flexibility up top, governance underneath.

The reason governance teams cheer for this approach is that modularity plus parameters make your dataflows transparent. Functions and parameter sets are easier to audit, easier to document, and far less prone to mystery logic creeping in. No more spaghetti queries nobody can explain—just clean building blocks that spell out what they do. And since duplication is gone, the volume of break/fix tickets drops like a rock.

So the rule is clear: modularize your transformations into functions, parameterize every environmental value, and keep shared logic centralized when reuse must cross workspaces. Do this, and dataflows stop being disposable hacks and start becoming durable assets. Skip it, and you’re signing up for copy-paste chaos.

But even the cleanest logic needs a reliable foundation to stand on. And in Fabric, that foundation is your lakehouse. If the storage layer underneath isn’t governed properly, everything built on top will eventually sink.

Taming the Lakehouse with Delta Lake

Which brings us to the piece that gives your lakehouse actual discipline: Delta Lake. In Fabric, Delta isn’t optional—it’s the difference between governed storage and a heap of CSVs disguised as a “platform.” What Delta brings to the table are three things you can’t fake: ACID transactions, schema enforcement, and versioning with time travel. ACID means no more half-written data corrupting reports—every write is atomic, consistent, isolated, and durable. Schema enforcement blocks sneaky mismatched data types or surprise columns before they poison your pipeline. And version logs with time travel give you the rollback and audit trail when compliance asks, “What did this dataset look like last quarter?” That’s not theory; that’s operational survival.

The opposite is what you see all the time: admins drag and drop CSVs into OneLake, declare victory, and forget that governance is more than shared storage. On the surface, it looks centralized. Reality? A junk pile. No schema validation, no transaction logs, no audit history. Querying across that is like fishing blindfolded in mud—you might catch something, but most of what you reel in is garbage. And when schema drift hits—say someone renames or reorders columns—you’re left with silent corruption until an angry manager points out that one report shows 112 sales and another swears it’s 109. Without ACID, schema enforcement, and time travel, you can’t even tell where the truth went.

Delta Lake fixes this by putting hard structure around the chaos. Every write produces a transaction log. Every change produces a trackable version. Queries hit governed tables instead of raw files, so when formats shift they break fast and visibly instead of corrupting weeks of output. With schema enforcement, if your vendor slips in two new fields, the job fails where you can see it, not downstream in someone’s quarterly board deck. And with time travel, you can literally query yesterday’s or last week’s snapshot—perfect for audits, debugging, or recovering from the intern who wrote a join that ate half the dataset.

Here’s the metaphor: dumping flat files into storage is a junk drawer. Technically everything’s “saved,” but you won’t find what you need without blood pressure spiking. Delta Lake is a cabinet that actually locks, labels, and versions. You can track changes, retrieve old entries, and trust that the count doesn’t change depending on which light switch is on. Which one would you rather bet your finance reports on?

And if someone on your team says, “We’ll skip Delta, we don’t need the overhead,” point them at audit time. Without transactional logs and schema enforcement, you’ll have inconsistent row counts between dashboards nobody can reconcile, and a forensic process as painful as piecing together a shredded Excel workbook. If compliance ever knocks, you’ll wish you had the logs Delta generates for free.

Real life example: LinkedIn documented a pipeline where they used a gateway to pull CSVs directly into OneLake and then leaned on dynamic schema options. That works—but the wise part was landing them in Delta tables, which gave them controlled schema handling and versioning instead of betting on flat files to play nice forever. It’s the difference between “refresh fails safe and we fix in five minutes” versus “three weeks of corrupted financials and loss of trust.”

And it’s not just governance—Delta helps you keep performance sane too. Queries don’t need to churn through countless micro-files just to return a month of data. You can partition data by logical attributes like date or region, which slashes the amount scanned. On top of that, Fabric supports performance practices for Delta such as OPTIMIZE, which merges those tiny files into efficient chunks; VACUUM, which clears out obsolete data files; and Z-ordering, which clusters data around heavily queried columns so filtering doesn’t drag. They’re simple to run, and they stop your storage from bloating while keeping queries snappy.

I’ve seen the cost of ignoring this. One client thought solid dataflow logic alone gave them safety. Then a vendor dropped extra columns in the weekly file, nobody noticed, and it corrupted weeks of downstream reporting. It wasn’t malice—it was just a lack of schema enforcement. Once we flipped to Delta, the file failed loudly on day one, they adjusted the mapping, and trust in the reports was intact. Same engineers, same tools—the only difference was using Delta’s structure instead of flat file faith.

Bottom line—Delta Lake isn’t extra credit. It’s insurance baked into your storage. Skip it, and you’ll find out the hard way during a compliance check or a performance crunch. Use it, and your lakehouse stops being a liability and becomes a reliable engine you can actually build on.

Now, governance is only half the battle. Even with Delta in place, sloppy design choices around partitioning or brute-force reloads will drag you back into slow query hell. And that’s where a lot of admins still bleed time.

Performance Pitfalls: Partitioning and Incremental Refresh Done Right

Partitioning and incremental refresh are the two basics that decide if your pipeline flies or crawls. Partitioning means you chop data into segments that match how it’s usually queried—say by month or region—so Fabric only pulls the slices it needs. Incremental refresh skips the brute force reloads and processes only new or changed data since the last run. Together, they’re the reason some queries feel instant while others make you wonder if the server died.

Partitioning is hands down the sharpest tool to speed up large datasets, but only if you pick the right grain. Align partitions with real query patterns. If most reports are month-based, partition by month. If rollups are by region, slice by region. What you don’t do is carve up by random columns like user ID or churn out daily folders for ten years of data. That’s how you create hundreds of tiny files that waste more time being managed than actually read. Imagine cutting a pizza into a hundred shards—it’s technically partitioned, but good luck eating it. The overhead buries any benefit. Bottom line: pick natural query keys, avoid high-cardinality junk, and always test your approach at production scale. Something slick on 10,000 rows can crash and burn once you load 10 million.

Even if partitions are perfect, full reloads will grind things down. That’s why incremental refresh matters. Instead of repeatedly hauling every row across the wire, it just picks up what changed. Think of it like a phone update. Nobody wipes the OS to reinstall it whole every time; they patch what’s new. Skip incremental refresh and you’re basically re-flashing the entire phone, night after night. Real-world difference? We’ve seen nightly jobs processing tens of millions of rows drop from multi-hour marathons down to minute-level updates just by flipping to incremental mode. Same hardware, no code changes, completely different outcome.

And here’s the catch: piling on more capacity won’t fix poor design here. You can burn through budget on bigger machines, but if you’re still slicing data into toothpick partitions and reloading everything nightly, performance will sink. Capacity helps only if the design underneath is efficient. Partitioning tied to behavior, plus refresh that patches instead of rebuilds, solves more than renting a larger server ever will.

There are some easy practices to keep partitions healthy too. If you’re landing data in Delta tables, don’t let them rot into thousands of micro-files. Run OPTIMIZE to compact them into efficient blocks. And when writing out, use coalesce to reduce partition counts where possible. Repartitioning has its place, but coalesce is faster for merging down to a practical number without reshuffling the whole dataset. These tiny tweaks pay off in faster queries and smoother refresh cycles.

The net effect is refreshes that don’t feel like punishment. Instead of watching jobs chew compute for hours, you get predictable performance where dashboards pop when users actually open them. That’s the goal—not magic, not brute force, just applying partitioning and incremental refresh with common sense and a few guardrails.

Do these right, and you’ve taken fragile batch jobs and turned them into pipelines that scale without bankrupting your capacity budget. Miss them, and you’ll keep wondering why performance tanks every time volume doubles. At this point, the pattern is clear: good dataflows aren’t about hacks or shortcuts. They’re designed to stay consistent even as the landscape shifts. And that leads us to the last piece of the puzzle—building for resilience so they don’t just work once, but keep working when everything around them inevitably changes.

Conclusion

Here’s the bottom line. Reusable dataflows don’t come from hope—they come from three habits practiced every time. One: modularize and parameterize so you’re not hand-editing the same logic fifteen times. Two: land everything in Delta Lake so ACID transactions, schema enforcement, and version logs keep your data trustworthy. Three: tune partitioning and use incremental refresh so jobs run fast instead of burning hours and credits. Those three cover 90% of what keeps pipelines durable.

Want the quick win? Subscribe to the podcast and leave me a review—otherwise I’ll assume you enjoy spending weekends debugging CSVs. Don’t make me call your boss.

Treat the lakehouse like an asset with governance, and you’ll watch maintenance tickets drop instead of multiply.



This is a public episode. If you'd like to discuss this with other subscribers or get access to bonus episodes, visit m365.show/subscribe

Mirko Peters Profile Photo

Founder of m365.fm, m365.show and m365con.net

Mirko Peters is a Microsoft 365 expert, content creator, and founder of m365.fm, a platform dedicated to sharing practical insights on modern workplace technologies. His work focuses on Microsoft 365 governance, security, collaboration, and real-world implementation strategies.

Through his podcast and written content, Mirko provides hands-on guidance for IT professionals, architects, and business leaders navigating the complexities of Microsoft 365. He is known for translating complex topics into clear, actionable advice, often highlighting common mistakes and overlooked risks in real-world environments.

With a strong emphasis on community contribution and knowledge sharing, Mirko is actively building a platform that connects experts, shares experiences, and helps organizations get the most out of their Microsoft 365 investments.