Migrating SAS to Databricks: DATA Steps to PySpark and Delta Lake

April 8, 2026 · 18 min read · MigryX Team

SAS has been the backbone of enterprise analytics for decades. From financial institutions running credit-risk models to pharmaceutical companies processing clinical trial data, SAS programs power mission-critical workloads across every regulated industry. But the economics of SAS licensing — per-core pricing that escalates with server capacity — combined with a shrinking talent pool and the industry's shift toward open-source analytics, have pushed organizations to seek alternatives. Databricks, built on Apache Spark and anchored by Delta Lake, has emerged as the leading destination for SAS migration.

The appeal is structural, not cosmetic. PySpark provides a native fit for SAS DATA step logic at distributed scale. MLflow replaces SAS analytics procedures with a modern experiment-tracking and model-deployment framework. The Lakehouse architecture unifies data warehousing and data science into a single platform, eliminating the need for separate SAS servers, SAS Grid clusters, and standalone databases. And the cost model shifts from perpetual per-core licensing to elastic, pay-per-use compute that scales to zero when idle.

This article provides a comprehensive technical guide to migrating SAS programs to Databricks, covering every major SAS construct and its Databricks equivalent, with detailed code examples and architectural guidance.

Why Migrate from SAS to Databricks?

Before diving into the technical mapping, it is worth understanding the strategic forces driving SAS-to-Databricks migration. These are not abstract trends — they are concrete business pressures that affect budget, hiring, and competitive positioning.

Licensing Cost Reduction

SAS licensing operates on a per-core, annual-renewal model that penalizes organizations for scaling infrastructure. A typical SAS deployment with Base SAS, SAS/STAT, SAS/ETS, SAS Enterprise Miner, and SAS Grid Manager can cost millions per year for a mid-sized analytics team. Databricks pricing is consumption-based: you pay for the compute hours you actually use, and clusters auto-terminate when jobs complete. Organizations routinely report 40-60% cost reductions after migrating from SAS to Databricks, even accounting for the migration investment.

PySpark as a Native Fit for SAS Logic

SAS DATA steps process data row by row with an implicit loop, applying conditional logic, creating variables, and writing output datasets. PySpark DataFrames express the same transformations using column-level operations that Spark distributes across a cluster. The conceptual mapping is direct: a SAS DATA step that reads, transforms, and outputs data becomes a PySpark chain of spark.read, .withColumn(), .filter(), and .write. The programming model is different — columnar rather than row-based — but the intent and output are identical.

MLflow Replaces SAS Analytics

SAS procedures like PROC LOGISTIC, PROC REG, PROC MIXED, and SAS Enterprise Miner provide statistical modeling and machine learning in a proprietary, closed environment. MLflow on Databricks offers experiment tracking, model versioning, model registry, and deployment — all integrated with PySpark, scikit-learn, TensorFlow, and PyTorch. The shift from SAS analytics to MLflow is not just a tool change; it unlocks the entire open-source ML ecosystem while maintaining the governance and reproducibility that enterprise analytics demands.

Lakehouse Architecture

SAS environments typically consist of SAS servers connected to relational databases, with SAS datasets stored on network file systems. Data moves between systems through PROC EXPORT, PROC IMPORT, and ODBC/JDBC connections. The Databricks Lakehouse eliminates this architecture by storing all data in Delta Lake tables on cloud object storage (S3, ADLS, GCS), governed by Unity Catalog, and accessible from notebooks, SQL endpoints, and BI tools. There is no separate data warehouse, no SAS library file system, and no batch export/import cycle.

Talent and Community

The SAS programmer workforce is aging. Universities have shifted curricula from SAS to Python and R. Hiring a SAS developer in 2026 is significantly more expensive than hiring a Python/PySpark developer with equivalent experience. Migrating to Databricks positions organizations to recruit from a much larger, younger, and more cost-effective talent pool.

SAS to Databricks migration — automated end-to-end by MigryX

SAS to Databricks migration — automated end-to-end by MigryX

Architecture Comparison: SAS Server vs. Databricks Clusters

Understanding the architectural differences between SAS and Databricks is essential for planning a migration. SAS operates on a shared-server model, while Databricks operates on an elastic-cluster model. The implications touch compute, storage, orchestration, and security.

In a traditional SAS deployment, a SAS Application Server runs on one or more physical or virtual machines. SAS programs execute sequentially or in parallel (via SAS Grid) on these servers, reading data from SAS datasets (on shared file systems) or databases (via SAS/ACCESS engines). SAS Metadata Server manages security, library definitions, and program metadata. SAS programs are submitted interactively through SAS Enterprise Guide, SAS Studio, or batch submission.

In Databricks, computation happens on ephemeral Apache Spark clusters that spin up when a job starts and terminate when it completes. Notebooks provide an interactive development environment similar to SAS Studio but with multi-language support (Python, SQL, Scala, R). Data is stored in Delta Lake tables on cloud object storage, governed by Unity Catalog. Databricks Workflows orchestrate job execution with scheduling, retries, and alerting.

SAS ComponentDatabricks EquivalentKey Difference
SAS Application ServerDatabricks ClusterEphemeral, auto-scaling vs. persistent, fixed-capacity
SAS Grid ManagerDatabricks Job ClustersAutomatic cluster provisioning per job vs. manual grid configuration
SAS Metadata ServerUnity CatalogSQL-based governance with column-level lineage
SAS Enterprise Guide / SAS StudioDatabricks NotebooksMulti-language, collaborative, version-controlled
SAS Library (LIBNAME)Unity Catalog SchemaThree-level namespace: catalog.schema.table
SAS Dataset (.sas7bdat)Delta Lake TableACID transactions, time travel, schema evolution
SAS/ACCESS EnginesSpark Connectors / JDBCNative connectors for hundreds of data sources
SAS Stored ProcessesDatabricks Jobs / SQL EndpointsREST API access, parameterized execution
SAS Batch SchedulingDatabricks WorkflowsDAG-based orchestration with dependency management

MigryX: Idiomatic Code, Not Line-by-Line Translation

The difference between MigryX and manual migration is not just speed — it is code quality. MigryX generates idiomatic, platform-optimized code that leverages native features of your target platform. A SAS DATA step does not become a clunky row-by-row loop — it becomes a clean, vectorized DataFrame operation. A PROC SQL query does not become a literal translation — it becomes an optimized query that takes advantage of your platform’s pushdown capabilities.

Comprehensive SAS-to-Databricks Mapping Table

The following table maps every major SAS construct to its Databricks equivalent. This serves as a reference for migration planning and is the foundation for MigryX's automated conversion engine.

SAS ConstructDatabricks / PySpark EquivalentNotes
DATA stepPySpark DataFrame transformationsRow-by-row logic becomes columnar operations with withColumn(), filter(), when()
PROC SQLDatabricks SQL / spark.sql()ANSI SQL with Delta Lake extensions; nearly 1:1 syntax migration
SAS Macros (%macro/%mend)Python functionsMacro variables become function parameters; %IF/%THEN becomes if/else
PROC SORT.orderBy() / .sort()Distributed sorting; also supports .sortWithinPartitions()
PROC MEANS / PROC FREQ.groupBy().agg()F.mean(), F.count(), F.stddev(), F.sum() with groupBy()
SAS Datasets (.sas7bdat)Delta Lake TablesACID transactions, schema enforcement, time travel
SAS Libraries (LIBNAME)Unity Catalog Schemascatalog.schema.table namespace replaces libref.dataset
PROC MODEL / Enterprise MinerMLflow + Feature StoreExperiment tracking, model registry, feature engineering
FORMAT / INFORMATPySpark Types + UDFscast(), to_date(), date_format() for type conversions
PROC EXPORTDelta table write / .write.csv()Write to Delta, Parquet, CSV, JSON, or external systems
MERGE statementDelta MERGE INTOAtomic upsert with WHEN MATCHED / WHEN NOT MATCHED
BY-group processing.groupBy()FIRST. and LAST. logic via Window functions with row_number()
PROC TRANSPOSEpivot() / unpivot()Native PySpark pivot operations
PROC LOGISTIC / PROC REGscikit-learn / SparkML + MLflowFull ML ecosystem with experiment tracking
PROC REPORT / PROC PRINTdisplay() / Databricks SQL DashboardsInteractive visualization in notebooks or dashboards
SAS/CONNECT (RSUBMIT)Databricks Workflows (multi-task)Remote execution replaced by distributed job orchestration
Retained variablesWindow functions / lag()F.lag() and F.lead() replace DATA step RETAIN
ArraysList comprehensions + select()Python loops over column lists replace SAS arrays
Hash objectsbroadcast join / map lookupSpark broadcast joins replace SAS hash table lookups

Code Examples: SAS DATA Step to PySpark

The most common migration task is converting SAS DATA steps to PySpark DataFrame transformations. The following examples demonstrate the pattern for increasingly complex scenarios.

Basic DATA Step: Read, Transform, Output

# SAS Original:
# data work.customers_clean;
#   set raw.customers;
#   where age >= 18 and status = 'ACTIVE';
#   full_name = catx(' ', first_name, last_name);
#   age_group = ifc(age >= 65, 'Senior',
#               ifc(age >= 30, 'Adult', 'Young Adult'));
#   signup_year = year(signup_date);
# run;

# PySpark Equivalent on Databricks:
from pyspark.sql import functions as F

customers = spark.table("catalog.raw.customers")

customers_clean = (
    customers
    .filter(
        (F.col("age") >= 18) &
        (F.col("status") == "ACTIVE")
    )
    .withColumn("full_name",
        F.concat_ws(" ", F.col("first_name"), F.col("last_name"))
    )
    .withColumn("age_group",
        F.when(F.col("age") >= 65, "Senior")
         .when(F.col("age") >= 30, "Adult")
         .otherwise("Young Adult")
    )
    .withColumn("signup_year", F.year(F.col("signup_date")))
)

customers_clean.write.mode("overwrite").saveAsTable("catalog.work.customers_clean")

BY-Group Processing with FIRST. and LAST.

SAS DATA step BY-group processing uses automatic variables FIRST.variable and LAST.variable to detect group boundaries. In PySpark, this translates to Window functions with row_number() and boundary detection logic.

# SAS Original:
# proc sort data=transactions; by customer_id transaction_date; run;
# data work.first_last_txn;
#   set transactions;
#   by customer_id;
#   if first.customer_id then first_txn_date = transaction_date;
#   if last.customer_id then last_txn_date = transaction_date;
#   retain first_txn_date;
#   if last.customer_id then output;
# run;

# PySpark Equivalent:
from pyspark.sql.window import Window

w = Window.partitionBy("customer_id").orderBy("transaction_date")
w_desc = Window.partitionBy("customer_id").orderBy(F.desc("transaction_date"))

first_last_txn = (
    spark.table("catalog.bronze.transactions")
    .withColumn("rn_asc", F.row_number().over(w))
    .withColumn("rn_desc", F.row_number().over(w_desc))
    .groupBy("customer_id")
    .agg(
        F.min("transaction_date").alias("first_txn_date"),
        F.max("transaction_date").alias("last_txn_date"),
        F.count("*").alias("txn_count"),
        F.sum("amount").alias("total_amount")
    )
)

first_last_txn.write.mode("overwrite").saveAsTable("catalog.silver.customer_txn_summary")

RETAIN Statement and Running Totals

The SAS RETAIN statement carries variable values across iterations of the DATA step loop. In PySpark, this maps to Window functions with cumulative aggregation.

# SAS Original:
# data work.running_totals;
#   set transactions;
#   by customer_id;
#   retain running_balance 0;
#   if first.customer_id then running_balance = 0;
#   running_balance = running_balance + amount;
# run;

# PySpark Equivalent:
w_running = (
    Window
    .partitionBy("customer_id")
    .orderBy("transaction_date")
    .rowsBetween(Window.unboundedPreceding, Window.currentRow)
)

running_totals = (
    spark.table("catalog.bronze.transactions")
    .withColumn("running_balance", F.sum("amount").over(w_running))
)

running_totals.write.mode("overwrite").saveAsTable("catalog.silver.running_totals")

Code Examples: PROC SQL to Databricks SQL

SAS PROC SQL is remarkably close to standard SQL, making it one of the easiest constructs to migrate. Databricks SQL supports ANSI SQL with extensions for Delta Lake operations, window functions, and semi-structured data.

-- SAS Original:
-- proc sql;
--   create table work.revenue_summary as
--   select
--     region,
--     product_line,
--     count(distinct customer_id) as unique_customers,
--     sum(revenue) as total_revenue,
--     mean(revenue) as avg_revenue,
--     calculated total_revenue / sum(sum(revenue)) as pct_total format=percent8.2
--   from sales.transactions
--   where transaction_date between '01JAN2025'd and '31DEC2025'd
--   group by region, product_line
--   having calculated total_revenue > 100000
--   order by total_revenue desc;
-- quit;

-- Databricks SQL Equivalent:
CREATE OR REPLACE TABLE catalog.gold.revenue_summary AS
SELECT
    region,
    product_line,
    COUNT(DISTINCT customer_id) AS unique_customers,
    SUM(revenue) AS total_revenue,
    AVG(revenue) AS avg_revenue,
    SUM(revenue) / SUM(SUM(revenue)) OVER () AS pct_total
FROM catalog.silver.transactions
WHERE transaction_date BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY region, product_line
HAVING SUM(revenue) > 100000
ORDER BY total_revenue DESC;

Key differences: SAS uses calculated to reference derived columns within the same query; Databricks SQL uses window functions or subqueries. SAS date literals ('01JAN2025'd) become ISO format strings. SAS formats (format=percent8.2) become formatting in the presentation layer.

SAS PROC SQL Subqueries and Correlated Queries

-- SAS Original:
-- proc sql;
--   select a.customer_id, a.customer_name, b.total_orders
--   from customers a
--   inner join (
--     select customer_id, count(*) as total_orders
--     from orders
--     group by customer_id
--   ) b on a.customer_id = b.customer_id
--   where b.total_orders > (select avg(order_count) from order_summary);
-- quit;

-- Databricks SQL Equivalent:
SELECT a.customer_id, a.customer_name, b.total_orders
FROM catalog.silver.customers a
INNER JOIN (
    SELECT customer_id, COUNT(*) AS total_orders
    FROM catalog.silver.orders
    GROUP BY customer_id
) b ON a.customer_id = b.customer_id
WHERE b.total_orders > (SELECT AVG(order_count) FROM catalog.gold.order_summary);

SAS Macros to Python Functions

SAS macros are a text-substitution system that generates SAS code at compile time. They are powerful but notoriously difficult to debug and maintain. Python functions provide a cleaner, more testable replacement with full access to the language's control flow, exception handling, and package ecosystem.

# SAS Original:
# %macro process_monthly(dsn=, start_date=, end_date=);
#   %let month_list = ;
#   data &dsn._processed;
#     set &dsn.;
#     where date between "&start_date."d and "&end_date."d;
#     month_key = put(date, yyyymm6.);
#   run;
#   proc means data=&dsn._processed noprint;
#     by month_key;
#     var revenue units;
#     output out=&dsn._monthly sum= mean= / autoname;
#   run;
# %mend;
# %process_monthly(dsn=sales, start_date=01JAN2025, end_date=31DEC2025);

# Python/PySpark Equivalent on Databricks:
def process_monthly(table_name: str, start_date: str, end_date: str) -> None:
    """Process a table by month, computing revenue and unit aggregates."""
    df = spark.table(f"catalog.bronze.{table_name}")

    processed = (
        df
        .filter(
            (F.col("date") >= start_date) &
            (F.col("date") <= end_date)
        )
        .withColumn("month_key", F.date_format(F.col("date"), "yyyyMM"))
    )

    monthly = (
        processed
        .groupBy("month_key")
        .agg(
            F.sum("revenue").alias("revenue_sum"),
            F.mean("revenue").alias("revenue_mean"),
            F.sum("units").alias("units_sum"),
            F.mean("units").alias("units_mean")
        )
        .orderBy("month_key")
    )

    monthly.write.mode("overwrite").saveAsTable(
        f"catalog.silver.{table_name}_monthly"
    )

# Call the function
process_monthly("sales", "2025-01-01", "2025-12-31")

The Python version is easier to read, test, and maintain. Parameters are typed. The function can be unit-tested without a SAS runtime. Error handling uses standard Python try/except rather than SAS macro error trapping with %sysfunc and &syscc.

SAS MERGE to Delta MERGE INTO

The SAS MERGE statement combines datasets by a common key, handling matched and unmatched records. Delta Lake's MERGE INTO provides atomic upsert semantics that go beyond what SAS MERGE offers, including conditional updates, deletes, and schema evolution.

# SAS Original:
# proc sort data=existing; by customer_id; run;
# proc sort data=updates; by customer_id; run;
# data work.merged;
#   merge existing(in=a) updates(in=b);
#   by customer_id;
#   if a and b then update_flag = 'U';
#   else if b then update_flag = 'I';
# run;

# Databricks Delta MERGE Equivalent:
from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "catalog.silver.customers")
source = spark.table("catalog.bronze.customer_updates")

target.alias("t").merge(
    source.alias("s"),
    "t.customer_id = s.customer_id"
).whenMatchedUpdate(set={
    "name": "s.name",
    "email": "s.email",
    "phone": "s.phone",
    "address": "s.address",
    "updated_at": "current_timestamp()",
    "update_flag": "'U'"
}).whenNotMatchedInsert(values={
    "customer_id": "s.customer_id",
    "name": "s.name",
    "email": "s.email",
    "phone": "s.phone",
    "address": "s.address",
    "created_at": "current_timestamp()",
    "updated_at": "current_timestamp()",
    "update_flag": "'I'"
}).execute()
Delta MERGE INTO is superior to SAS MERGE in every measurable dimension: it is atomic (ACID-compliant), handles concurrent writes safely, supports conditional logic within MATCHED and NOT MATCHED clauses, and automatically maintains audit columns. SAS MERGE, by contrast, requires pre-sorted input and provides no transactional guarantees.

SAS Procedures to PySpark Equivalents

PROC SORT to .orderBy()

# SAS: proc sort data=sales; by region descending revenue; run;
# PySpark:
sales_sorted = spark.table("catalog.silver.sales").orderBy("region", F.desc("revenue"))

# SAS: proc sort data=sales nodupkey; by customer_id; run;
# PySpark (remove duplicates):
sales_deduped = (
    spark.table("catalog.silver.sales")
    .dropDuplicates(["customer_id"])
)

PROC MEANS / PROC FREQ to .groupBy().agg()

# SAS: proc means data=sales n mean std min max sum;
#        var revenue units;
#        class region;
#      run;
# PySpark:
summary = (
    spark.table("catalog.silver.sales")
    .groupBy("region")
    .agg(
        F.count("revenue").alias("n"),
        F.mean("revenue").alias("revenue_mean"),
        F.stddev("revenue").alias("revenue_std"),
        F.min("revenue").alias("revenue_min"),
        F.max("revenue").alias("revenue_max"),
        F.sum("revenue").alias("revenue_sum"),
        F.count("units").alias("units_n"),
        F.mean("units").alias("units_mean"),
        F.sum("units").alias("units_sum")
    )
)

# SAS: proc freq data=sales; tables region * product_line / chisq; run;
# PySpark (cross-tabulation):
crosstab = (
    spark.table("catalog.silver.sales")
    .groupBy("region", "product_line")
    .agg(F.count("*").alias("frequency"))
    .orderBy("region", "product_line")
)
display(crosstab)

PROC TRANSPOSE to pivot()

# SAS: proc transpose data=monthly_sales out=pivoted prefix=month_;
#        by region;
#        id month;
#        var revenue;
#      run;
# PySpark:
pivoted = (
    spark.table("catalog.silver.monthly_sales")
    .groupBy("region")
    .pivot("month")
    .agg(F.sum("revenue"))
)

pivoted.write.mode("overwrite").saveAsTable("catalog.gold.pivoted_sales")
MigryX Screenshot

MigryX precision parser — Deep AST-level analysis ensures every construct is understood before conversion begins

Platform-Specific Optimization by MigryX

MigryX maintains deep knowledge of every target platform’s strengths and best practices. When converting to Snowflake, it leverages Snowpark and native SQL functions. When targeting Databricks, it uses PySpark DataFrame operations optimized for distributed execution. When generating dbt models, it follows dbt best practices for modularity and testability. This platform awareness is what makes MigryX output production-ready from day one.

SAS Libraries and Datasets to Unity Catalog and Delta Lake

In SAS, libraries are defined with LIBNAME statements that point to directories or database connections. Datasets are physical files within those library paths. Unity Catalog replaces this model with a three-level namespace: catalog.schema.table.

# SAS Library definitions:
# libname raw '/data/raw';
# libname silver oracle user=myuser password=mypwd path=mydb;
# libname gold '/data/gold';

# Databricks Unity Catalog equivalent:
# No LIBNAME needed — tables are referenced directly:
# catalog.raw.customers      (replaces raw.customers)
# catalog.silver.customers   (replaces silver.customers)
# catalog.gold.customers     (replaces gold.customers)

# Access control via SQL GRANT:
# GRANT SELECT ON SCHEMA catalog.gold TO `analysts`;
# GRANT ALL PRIVILEGES ON SCHEMA catalog.silver TO `data_engineers`;

Every SAS dataset maps to a Delta Lake table. SAS dataset options (KEEP=, DROP=, WHERE=, RENAME=) map to PySpark operations: .select() for KEEP/DROP, .filter() for WHERE, .withColumnRenamed() for RENAME. SAS formats and informats map to PySpark type casts and format functions.

Medallion Architecture: Organizing the Migration

The Databricks Medallion Architecture (Bronze, Silver, Gold) provides a natural framework for organizing migrated SAS programs. This layered approach maps cleanly to the typical SAS data flow: raw ingestion, cleaning and transformation, and reporting-ready output.

Medallion Architecture Mapping

The Medallion Architecture also provides a migration sequencing strategy: migrate Bronze programs first (data ingestion is lowest risk), then Silver (transformation logic is the core of most SAS programs), and finally Gold (reporting and analytics). This incremental approach allows validation at each layer before proceeding to the next.

SAS Analytics to MLflow on Databricks

SAS Enterprise Miner and SAS/STAT procedures (PROC LOGISTIC, PROC REG, PROC MIXED, PROC GENMOD) provide statistical modeling in a proprietary environment. MLflow on Databricks replaces this with an open-source experiment-tracking and model-management platform.

# SAS Original:
# proc logistic data=model_data;
#   model churn(event='1') = tenure monthly_charges total_charges
#                             contract_type payment_method;
#   output out=scored predicted=pred_prob;
# run;

# Databricks MLflow Equivalent:
import mlflow
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

model_data = spark.table("catalog.gold.model_data").toPandas()
X = model_data[["tenure", "monthly_charges", "total_charges",
                 "contract_type_encoded", "payment_method_encoded"]]
y = model_data["churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

with mlflow.start_run(run_name="churn_logistic_v1"):
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    pred_prob = model.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, pred_prob)

    mlflow.log_metric("auc_roc", auc)
    mlflow.log_param("model_type", "logistic_regression")
    mlflow.sklearn.log_model(model, "churn_model",
                              registered_model_name="churn_prediction")

    print(f"AUC-ROC: {auc:.4f}")

MLflow provides capabilities that SAS Enterprise Miner cannot match: automatic experiment tracking, model versioning with lineage, A/B testing of model versions, and deployment to REST endpoints, batch scoring jobs, or streaming inference — all without additional licensing.

FORMAT and INFORMAT to PySpark Types

SAS formats and informats control how data is displayed and read. In PySpark, type conversions use cast(), to_date(), to_timestamp(), and date_format().

# SAS formats and their PySpark equivalents:
# FORMAT date_col DATE9.         → F.date_format(F.col("date_col"), "ddMMMyyyy")
# FORMAT amount DOLLAR12.2       → F.format_number(F.col("amount"), 2)
# FORMAT pct PERCENT8.2          → F.concat(F.round(F.col("pct") * 100, 2), F.lit("%"))
# INFORMAT char_date ANYDTDTE.   → F.to_date(F.col("char_date"), "yyyy-MM-dd")
# INPUT(char_num, 8.)            → F.col("char_num").cast("double")

df = (
    spark.table("catalog.bronze.raw_data")
    .withColumn("transaction_date", F.to_date(F.col("date_str"), "MM/dd/yyyy"))
    .withColumn("amount", F.col("amount_str").cast("decimal(12,2)"))
    .withColumn("quantity", F.col("qty_str").cast("integer"))
    .withColumn("created_ts", F.to_timestamp(F.col("ts_str"), "yyyy-MM-dd HH:mm:ss"))
)

PROC EXPORT to Delta Table Writes

SAS PROC EXPORT writes datasets to external formats (CSV, Excel, databases). In Databricks, the primary target is Delta Lake tables, but PySpark supports writing to any format including CSV, Parquet, JSON, and external databases.

# SAS: proc export data=gold.report outfile='/output/report.csv' dbms=csv replace; run;
# PySpark:
spark.table("catalog.gold.report").write.mode("overwrite").csv("/mnt/output/report.csv", header=True)

# SAS: proc export data=gold.report dbms=xlsx outfile='/output/report.xlsx' replace; run;
# PySpark (using pandas for Excel):
spark.table("catalog.gold.report").toPandas().to_excel("/tmp/report.xlsx", index=False)

# Primary pattern: write to Delta Lake (no export needed)
result_df.write.mode("overwrite").saveAsTable("catalog.gold.report")

How MigryX Automates SAS-to-Databricks Migration

MigryX provides automated conversion of SAS programs to Databricks-ready PySpark notebooks. The platform parses SAS source code — DATA steps, PROC SQL, macro definitions, PROC SORT, PROC MEANS, MERGE statements, and all supporting constructs — and generates equivalent PySpark code with Delta Lake integration.

The MigryX engine handles the most challenging aspects of SAS migration:

For organizations with thousands of SAS programs, manual migration is impractical. A typical enterprise SAS estate contains 5,000-20,000 programs with complex interdependencies, macro libraries, and custom formats. MigryX processes the entire estate, generates converted code, and produces validation reports that compare source and target output row counts, column profiles, and statistical summaries.

Key Takeaways

Migrating from SAS to Databricks is not just a technology change — it is a fundamental shift in how organizations approach analytics. The move from proprietary, per-core licensing to open-source, elastic compute unlocks both cost savings and architectural flexibility. PySpark provides a natural home for SAS DATA step logic. Delta Lake surpasses SAS datasets in every dimension: ACID transactions, time travel, schema evolution, and concurrent access. And the Databricks ecosystem — MLflow, Unity Catalog, Databricks SQL — delivers capabilities that would require multiple SAS products and additional licenses to replicate.

The technical mapping is clear. The economic case is compelling. The question is not whether to migrate, but how to execute efficiently. For organizations with large SAS estates, automated conversion with MigryX reduces migration timelines from years to months, preserving institutional logic while unlocking the full potential of the Databricks Lakehouse.

Why MigryX Delivers Superior Migration Results

The challenges described throughout this article are exactly what MigryX was built to solve. Here is how MigryX transforms this process:

MigryX combines precision AST parsing with Merlin AI to deliver 99% accurate, production-ready migration — turning what used to be a multi-year manual effort into a streamlined, validated process. See it in action.

Ready to migrate from SAS to Databricks?

See how MigryX converts SAS DATA steps, PROC SQL, macros, and procedures to production-ready PySpark notebooks with Delta Lake and MLflow integration.

Explore SAS Migration   Schedule a Demo