What does a Data Engineer do?

1. Why do we need Data Engineers?

Reference

Goal:

Input:

Output:

Process:

Stores structured data from different sources in a way that suits analytics purpose.

Stores all the raw data.

ETL will become ELT:

Data Engineer will do the EL part and Data Scientists do the T part.

Data Engineer needs to create Custom ETL (ad-hoc task) and provide Data Lake (raw data) for data scientists.

Related to the ETL’s Exact part, in big data world, new data is generated in real-time.

In tranditional ETL, we were fetching batch data from source through API requests, this is called Synchronous Communication.

In Big Data world, we need to use Asynchronous Communication by adopting the pub-sub pattern.

Common technologies:

server cluster will be used to store the data
- scalable is the key
- redundancy is created to ensure data safety
we need to use a big data processing framework to interact with distributed data

Common technologies:

Works with ETL/ELT processes to consume data from different data sources and load into Data Warehouse and data lake for business usage.

The design of data warehouse should be suitable for the end users (Data Analysts, Data Scientists, Machine Learning Engineer, etc).

Consume data from Data Warehouse and Data Lake, develop model and make predictions.

Consume data from BI interface (linked to Data Warehouse) and develop reports.

Make use of the output of ETL and produce some real-time recommendations for the user.