
Data ETL Framework
SoftB*nk Robotics, 2017
We use
AWS Cloud Platform
, 30 services, to produce100 GB
IoT data daily from diverse databases (Mysql
,Cassandra
,Salesforce
, etc.) to SoftbankData Lake
.Details
The SBRE Data Lake
has been storing about 85 TB data since 2016 so that interested people (data science team or other SBR teams or external partners) can access and analyze it.
The Data Sources
Data have been fetched from 4 diverse sources, such as: from
Cassandra
via API, fromMySql
via SQL query; fromSalesforce
via API, and fromDynamoDB
via API. All Data Lake data are stored in AWS S3 buckets.
The Cloud Architecture
Everyday Data Lake data are produced by 3
AWS EC2
clusters. For certain big steps,EMR clusters
will be triggered separately to ease the heavy traffic on EC2 clusters.Bastion
is used for the security control.
The Luigi Flow
There are around 30 to 40 tasks to produce data from different databases everyday. We use
Luigi
to control the complexity of the dependencies of all tasks.
The MonitoringThe Luigi Flow
A simple monitoring project has been built to alert any issues or crashes of daily ETL process.