etl

ETL Data Flow

Extract, Transform, and Load (ETL) for Softbank Data Lake

We use

AWS Cloud Platform

, 30 services, to produce

100 GB

IoT data daily from diverse databases (

Mysql

,

Cassandra

,

Salesforce

, etc.) to Softbank

Data Lake

.

  • Python
  • Luigi
  • AWS Cloud Platform
  • AWS EMR (Elastic MapReduce)
  • AWS EC2 (Elastic Compute Cloud)
  • AWS S3 (Simple Storage Service)
  • PySpark / Spark
  • Pandas
  • Pytest
  • Git
  • Mysql
  • Salesforce
  • Data Lake

Details

The SBRE Data Lake

has been storing about 85 TB data since 2016 so that interested people (data science team or other SBR teams or external partners) can access and analyze it.

The Data Sources


Data have been fetched from 4 diverse sources, such as: from

Cassandra

via API, from

MySql

via SQL query; from

Salesforce

via API, and from

DynamoDB

via API. All Data Lake data are stored in AWS S3 buckets.

The Cloud Architecture


Everyday Data Lake data are produced by 3

AWS EC2

clusters. For certain big steps,

EMR clusters

will be triggered separately to ease the heavy traffic on EC2 clusters.

Bastion

is used for the security control.

The Luigi Flow


There are around 30 to 40 tasks to produce data from different databases everyday. We use

Luigi

to control the complexity of the dependencies of all tasks.

The MonitoringThe Luigi Flow


A simple monitoring project has been built to alert any issues or crashes of daily ETL process.

AWS EMR