Today I spent some time exploring Azure Databricks service, I find the website UI is not very convenient in terms of writing code (too many mouse operations).
So I decide to set up Databricks with my favourite IDE - VS code, so that I can use my local editor to interact with cluster and run code.
This tutorial aims to help you with the environment setup process (I had a lot of fun with it, so I hope this post can help someone else trying to do the same thing, or my future self who have completely forgotten the steps).
Databricks vs Apache Spark
- Databricks is an Analytics Platform based on Apache Spark
- Apache Spark is an Analytics Engine.
Databricks is an Apache Spark based Unified Analytics Platform, optimized for the cloud.
Databricks is based on Apache Spark, so it has all the features from Spark, supports multiple languages and libraries, etc.
In short, it is going to do the heavy lifting for us to use Spark.
It’s just Databricks running on Azure platform.
- It is an Azure service that allows you to build Apache Spark-based applications.
- It is an analytics platform runs on the cloud, allows you to build Spark applications easily.
To help us understand the terminologies, I’ve drawn this diagram to show the relationship of different pieces.
- we need to create a databricks service in Azure portal, it is a workspace with unique url and id
- we can create notebook in databricks workspace, notebook are essentially instructions using different languages to interact with cluster and storage mediums (e.g., DBFS)
- we can create many clusters in the workspace, we can also create a pool to share resources among different clusters
- cluster is the databricks runtime, it has different versions and you can specify how may workers you want to have, you will have to pay for the cluster’s computational resource (DBU is the measurement unit)
- your can attach your notebook to any clusters (or no cluster at all if you don’t need to run anything), once it’s attached, when you run the notebook instructions, for example, load a csv file from DBFS and create a table, the cluster attached will be doing the job based on instructions.
- database and tables are created and stored in the cluster
2. Databricks CLI
Create workspace and generate token
Note: you also need to create a cluster (go to compute -> cluster -> create), I am using version:
7.3 LTS (includes Apache Spark 3.0.1, Scala 2.12)
Install CLI on local machine
pip install databricks-cli databricks --version
Note: I’m having issue at this step on MacOS, but it is working fine on my windows machine. Solved: use
pip3 install databricks-cli
# powershell databricks configure --token # provide your workspace url, e.g., https://adb-xxxxxxxxx.azuredatabricks.net/ # provide your token # check the profile get-content ~/.databrickscfg
databricks fs ls
You should be able to see some folders.
3. Connect with IDE
1. Install some packages
Anaconda: Anaconda | Individual Edition
after installation, open the application and launch cmd.exe, create a dbconect in that window
conda create --name dbconnect python=3.7 conda
Winutil: run the following command in admin mode powershell window, it will download hadoop 2.7 and add into env variables
New-Item -Path "C:\Hadoop\Bin" -ItemType Directory -Force Invoke-WebRequest -Uri https://github.com/steveloughran/winutils/raw/master/hadoop-2.7.1/bin/winutils.exe -OutFile "C:\Hadoop\Bin\winutils.exe" [Environment]::SetEnvironmentVariable("HADOOP_HOME", "C:\Hadoop", "Machine")
2. Install databricks-connect
pip uninstall pyspark pip install -U "databricks-connect==7.3.*" # or X.Y.* to match your cluster version. # configure (refer to the official doc to find identifiers) databricks-connect configure # test connection databricks-connect test
3. Download IDE extensions
Databricks Connect - Azure Databricks | Microsoft Docs
For people using VS code, I find the steps pretty straightforward, you need to use Python 3.7 as interpreter, and change some configs in the preference UI, then you can test the connection by running some examples, like the below one (just create a new py file and right click to select run):
from pyspark.sql import SparkSession from pyspark.dbutils import DBUtils spark = SparkSession.builder.getOrCreate() dbutils = DBUtils(spark) print(dbutils.fs.ls("dbfs:/")) print(dbutils.secrets.listScopes())
- conda command not found: please open the anaconda application and launch cmd
- hadoop is not found: please check env variables, point to the directory a level above the Bin folder, e.g., C:\Hadoop is correct, C:\Hadoop\Bin is wrong.
- hadoop is still not found: close your IDE and terminal, restart them.