Data ETL Framework

SoftB*nk Robotics, 2017

View More

We use

AWS Cloud Platform

, 30 services, to produce

100 GB

IoT data daily from diverse databases (

Mysql

,

Cassandra

,

Salesforce

, etc.) to Softbank

Data Lake

.

Data ETL Framework
SoftB*nk Robotics
2017
Senior Data Engineer
Team of 3 data engineers
Paris, Île-de-France, France
Extract, Transform, Load (ETL)
Data Lake
Python
Luigi
AWS (Amazon Web Services)
AWS EMR (Elastic MapReduce)
AWS EC2 (Elastic Compute Cloud)
AWS S3 (Simple Storage Service)
PySpark/Spark
Pandas
Pytest
Git
Mysql
Salesforce

Details

The SBRE Data Lake

has been storing about 85 TB data since 2016 so that interested people (data science team or other SBR teams or external partners) can access and analyze it.

The Data Sources

Data have been fetched from 4 diverse sources, such as: from

Cassandra

via API, from

MySql

via SQL query; from

Salesforce

via API, and from

DynamoDB

via API. All Data Lake data are stored in AWS S3 buckets.

The Cloud Architecture

Everyday Data Lake data are produced by 3

AWS EC2

clusters. For certain big steps,

EMR clusters

will be triggered separately to ease the heavy traffic on EC2 clusters.

Bastion

is used for the security control.

The Luigi Flow

There are around 30 to 40 tasks to produce data from different databases everyday. We use

Luigi

to control the complexity of the dependencies of all tasks.

The MonitoringThe Luigi Flow

A simple monitoring project has been built to alert any issues or crashes of daily ETL process.

Data ETL Framework

SoftB*nk Robotics, 2017

View More

Details

AWS EMR