Azure Databricks Environment Setup

2021-11-07

0. Intro

Today I spent some time exploring Azure Databricks service, I find the website UI is not very convenient in terms of writing code (too many mouse operations).

So I decide to set up Databricks with my favourite IDE - VS code, so that I can use my local editor to interact with cluster and run code.

This tutorial aims to help you with the environment setup process (I had a lot of fun with it, so I hope this post can help someone else trying to do the same thing, or my future self who have completely forgotten the steps).

1. Context

Databricks vs Apache Spark

Databricks is an Analytics Platform based on Apache Spark
Apache Spark is an Analytics Engine.

Databricks

Databricks is an Apache Spark based Unified Analytics Platform, optimized for the cloud.

Databricks is based on Apache Spark, so it has all the features from Spark, supports multiple languages and libraries, etc.

In short, it is going to do the heavy lifting for us to use Spark.

Azure Databricks

It’s just Databricks running on Azure platform.

It is an Azure service that allows you to build Apache Spark-based applications.
It is an analytics platform runs on the cloud, allows you to build Spark applications easily.

Diagram

To help us understand the terminologies, I’ve drawn this diagram to show the relationship of different pieces.

we need to create a databricks service in Azure portal, it is a workspace with unique url and id
we can create notebook in databricks workspace, notebook are essentially instructions using different languages to interact with cluster and storage mediums (e.g., DBFS)
we can create many clusters in the workspace, we can also create a pool to share resources among different clusters
cluster is the databricks runtime, it has different versions and you can specify how may workers you want to have, you will have to pay for the cluster’s computational resource (DBU is the measurement unit)
your can attach your notebook to any clusters (or no cluster at all if you don’t need to run anything), once it’s attached, when you run the notebook instructions, for example, load a csv file from DBFS and create a table, the cluster attached will be doing the job based on instructions.
database and tables are created and stored in the cluster

2. Databricks CLI

Workflow

Create workspace and generate token

Note: you also need to create a cluster (go to compute -> cluster -> create), I am using version: 7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)

Install CLI on local machine

pip install databricks-cli
databricks --version

Note: I’m having issue at this step on MacOS, but it is working fine on my windows machine. Solved: use pip3 install databricks-cli

Configure profile

# powershell
databricks configure --token

# provide your workspace url, e.g., https://adb-xxxxxxxxx.azuredatabricks.net/
# provide your token

# check the profile
get-content ~/.databrickscfg

Test

 databricks fs ls

You should be able to see some folders.

3. Connect with IDE

Reference: https://docs.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect

1. Install some packages

Anaconda: Anaconda | Individual Edition
- after installation, open the application and launch cmd.exe, create a dbconect in that window
```
conda create --name dbconnect python=3.7
conda
```

Winutil: run the following command in admin mode powershell window, it will download hadoop 2.7 and add into env variables

New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force
Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe"
[Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")

2. Install databricks-connect

pip uninstall pyspark
pip install -U "databricks-connect==7.3.*"  # or X.Y.* to match your cluster version.

# configure (refer to the official doc to find identifiers)
databricks-connect configure

# test connection
databricks-connect test

3. Download IDE extensions

Databricks Connect - Azure Databricks | Microsoft Docs

For people using VS code, I find the steps pretty straightforward, you need to use Python 3.7 as interpreter, and change some configs in the preference UI, then you can test the connection by running some examples, like the below one (just create a new py file and right click to select run):

from pyspark.sql import SparkSession
from pyspark.dbutils import DBUtils

spark = SparkSession.builder.getOrCreate()

dbutils = DBUtils(spark)
print(dbutils.fs.ls("dbfs:/"))
print(dbutils.secrets.listScopes())

4. Troubleshooting

conda command not found: please open the anaconda application and launch cmd
hadoop is not found: please check env variables, point to the directory a level above the Bin folder, e.g., C:\Hadoop is correct, C:\Hadoop\Bin is wrong.
hadoop is still not found: close your IDE and terminal, restart them.

Success!