Microsoft Fabric: Optimizing Custom Spark Pools

Written by Patrick Gilbert | Jul 16, 2025 3:11:48 PM

Microsoft Fabric provides many tools to ingest, process, and refine your data, all while enabling storage in the same environment. For massive datasets, many of the traditional tools you may use for these processes will be inadequate, or at best, be very time and resource intensive. The Microsoft Fabric Runtime, based on Apache Spark, enables a few key components for avoiding these typical issues, namely utilizing Spark for large-scale data processing and analysis.

However, you may find yourself running into errors due to configuration problems or bottlenecking because of insufficient nodes or high traffic. Solving these issues can be as simple as creating a custom Spark pool and navigating a few of the most important settings. The following sections will go over these settings to give you a starting point for curating your Spark pool to fit your needs.

Spark Pools

Spark pools in Fabric are very simple at their core: a set of metadata that defines resource requirements and associated characteristics once you begin your Spark session. By itself, a Spark pool will not consume any resources or cost anything. Only once a Spark job is executed do charges begin.

Typically, your Spark session will take only around 5-10 seconds to start up if you have changed no settings and added no libraries. But, this is unrealistic for many use cases, and as illustrated in the above table, the start time begins to pile up fast when you add more requirements to the situation.

By creating a custom Spark pool, you can tell Spark about what resources you require for your jobs, namely the number of nodes, node size, and scaling. There are a few key settings, namely Autoscaling and High concurrency mode, that will greatly improve your Spark experience and even enable some cost-saving measures.

Autoscaling

An important setting with Spark pools is Autoscaling, which will scale up or down the compute resources depending on the current activity. This will require that you set a minimum and maximum number of nodes to scale, and the system will then add or remove nodes as needed. It is also possible to set up a single node Spark pool by setting your minimum nodes to one, which will run your driver and executor in one node. This is optimal for small workloads, especially if autoscaling is not enabled.

High concurrency mode

High concurrency mode allows users to share Spark sessions among items, allowing multiple items run on the same session. This enables users to swap between different notebooks seamlessly, without having to reinitialize a Spark session. Only the initial Spark session is billed, meaning any subsequent shared session on that initial session do not incur additional costs. This setting is great for teams to work simultaneously on multiple notebooks. Session sharing does have a few requirements:

Sessions should be within a single user boundary

Sessions should have the same default lakehouse configuration

Sessions should have the same Spark compute properties

Summary

Data processing can become very complex as your datasets grow larger, and the Microsoft Fabric Runtime provides an easy fix with Spark. But, to get the most out of Spark, it is important to be mindful of the Spark pool settings and what best suits your needs and fits within your capacity. If you have more questions regarding Microsoft Fabric and how best to manage your data, contact Spyglass!

View full post