Apache Spark is a parallel processing framework that supports in-memory processing to boost the performance of big-data analytic applications. Apache Spark in Azure Synapse Analytics is one of Microsoft’s implementations of Apache Spark in the cloud. Azure Synapse makes it easy to create and configure a serverless Apache Spark pool in Azure. Spark pools in Azure Synapse are compatible with Azure Storage, Azure SQL Database and Azure Data Lake Generation 2 Storage. So you can use Spark pools to process your data stored in Azure.
In this article, I will show you how to connect to an relational database like Azure SQL database and query data within a Synapse notebook. The authentication method in this exercise will be using service principal.
- Azure Synapse workspace (MS guide is here)
- Azure Synapse Apache Spark pool (MS guide is here)
- Service principal in Azure AD (MS guide is here)
The very first step in this exercise is to add the service principal to the Azure SQL DB as an external user. For this login to the Azure SQL DB using Azure Data Studio or SSMS using an Azure Activity Directory account with owner rights on the database. Execute the following T SQL code on the source Azure SQL DB
CREATE USER [my-synapse-serv-prin] FROM EXTERNAL PROVIDER GO EXEC sp_addrolemember 'db_reader', [my-synapse-serv-prin] GO
Please note the external user name is same as the service principal name you have in Azure AD.
Next we need to create a linked service with the source Azure SQL DB. In the Synapse workspace, click on the manage tab and create a new linked service. Below is an example of the connection details
Now we have setup everything needed to create a Synapse notebook to connect, query and work with data from relational databases.
From the Develop tab in Synapse workspace, create a new notebook, attach to the spark pool you created in the earlier prerequisite steps. Choose Python as the programming language.
In the first cell we need to create a connection to the Azure SQL DB using the linked service we created earlier.
%%pyspark access_token = TokenLibrary.getConnectionString("AzureSqlDatabase_service_principal") #print(access_token) url = "jdbc:sqlserver://azuresqlserver.database.windows.net:1433" customerDF = spark.read.format("com.microsoft.sqlserver.jdbc.spark") \ .option("url",url) \ .option("dbtable", "dbo.tb_Customer") \ .option("databasename", "My_Azure_SQLDB_Name") \ .option("accessToken", access_token) \ .option("encrypt", "true").option("hostNameInCertificate", "*.database.windows.net").load()
In this article we saw how to connect to relational data sources within Azure Synapse notebooks. Notebooks are also widely used in data preparation, data visualization, machine learning, and other big data scenarios. A Synapse Studio notebook is a web interface for you to create files that contain live code, visualizations, and narrative text. Notebooks are a good place to validate ideas and use quick experiments to get insights from your data.
With an Synapse Studio notebook, you can:
- Get started with zero setup effort.
- Keep data secure with built-in enterprise security features.
- Analyze data across raw formats (CSV, txt, JSON, etc.), processed file formats (parquet, Delta Lake, ORC, etc.), and SQL tabular data files against Spark and SQL.
- Be productive with enhanced authoring capabilities and built-in data visualization.