Sharing knowledge

Data Engineering in Cloud Environments


Zorica Karapancheva

Data Еngineer at Data Masters

Zorica Karapancheva

Data Engineer at Data Masters

Sharing knowledge

Data Engineering in Cloud Environments


Cloud Computing

In the second part of the blog series dedicated to Data Engineering, we will look into the subject of implementing Data Engineering in cloud environments.
Previously we discussed the benefits of Data Engineering and how this process facilitates company work, but bearing in mind that many companies store their data on cloud, it is important to look at the possibilities that are offered with this type of data administration.

The key processes of Data Engineering include complete data administration by gathering data from all relevant sources, transforming the extracted data for further analysis, and creating a data warehouse with optimal architecture. Additionally, depending on the location and type of data storage, Data Engineers have to have knowledge about on-premise database administration, as well as cloud computing.

Today, modern technologies which include big data analytics and artificial intelligence, demand great processing power and scalable warehouses.
Consequently, the cloud computing companies offer scalable alternatives, which if compared to local infrastructures are limited in that regard.

Dell reports that companies that invest in big data, cloud, mobility, and security enjoy up to 53% faster revenue growth than their competitors.

Dell reports that companies that invest in Big Data, Cloud Computing, mobility, and security enjoy up to 53% faster revenue growth than their competitors.
The cloud services like Amazon Web Services (AWS), Google Cloud Platform and Azure Cloud Services are most widely used by companies on a global scale.
They provide different on-demand services, scalability and elasticity, easy maintenance, security, automation, and flexibility.

Amazon Web Services

Amazon Web Services (AWS) is one of the leading providers of cloud computing services which offers more than 200 different services for its clients. Because they provide so many services in different domains, AWS offers companies complete cloud computing services and all related business processes. This platform provides five different Data Engineering tools:

  • Data Ingestion Tools
  • Data Storage Tools
  • Data Integration Tools
  • Data Warehouse Tools
  • Data Visualization Tools

Data Ingestion Tools

Data Ingestion is one of the key processes in Data Engineering. In this phase, we ingest raw data from heterogeneous sources, including different databases, mobile devices, sensors, etc. The goal is to create a data lake that contains all relevant data for one company. For data ingestion purposes from different types of sources, AWS offers the services Amazon Kinesis Firehose, AWS Snowball, and AWS Storage Gateway.

Amazon Kinesis Firehose allows direct transfer of data to the warehouse in real-time. Additionally, this tool offers data transformation, encryption, and compressing before ingestion.

If the client data is stored in on-premise databases, it can be replicated on a cloud using the service AWS Snowball.

In cases where it is necessary to import data on a daily basis and the data is needed for on-premise operations, AWS offers a service called AWS Storage Gateway.

Data Storage Tools

After the process of data ingestion, the data needs to be stored in a data lake. The AWS services offer different solutions depending on the type of data and the way it is used. The most used service for data storage is called Amazon S3, which allows the building of data lakes and storing large quantities of data in different forms. S3 is a scalable and quick solution through which the stored data can be implemented in a data lake that is logically divided into buckets and folders. This tool can be integrated with different AWS services and allows an uninterrupted flow of data between services.

Data Integration Tools

Another important aspect of Data Engineering is the integration of the data. The services for data integration combine the data from different sources through the centralized ETL (Extract – Transform – Load) process. This is where the different data sources are analyzed, the data is extracted, and transformed in order to generate appropriate schemes for their usage.

As a part of the data integration tools, we use AWS Glue as a key service in the process. Glue allows data gathering from different sources and transformation of the data. After the transformation process, a scheme of data is generated that describes all the entities, attributes, and types, for further loading of the data lake or warehouse. AWS Glue is a powerful service that offers many functionalities for data extraction and transformation, to get a standardized data scheme. This service includes a Data Catalog, a centralized place for all meta-data.

The AWS Glue functionalities can be summarized in the following way: transforming and extracting data with Glue Jobs, generating a standardized scheme through Crawler, and creating a database using the generated Data Catalog.

Data Warehouse Tools

The AWS services for building data warehouses can generate a repository of structured data from different sources. It is important to note that the difference between these tools and Amazon S3 is that data lakes ingest data in the original form and for general use. Compared, the data warehouses store data for a specific need, using a standardized scheme for optimization when used.

Amazon Redshift is a solution that offers petabyte storage for structured and semi-structured data. This service uses a standardized scheme of data for optimized usage during Business Intelligence analysis. With AWS Glue we extract the transformed data from Amazon S3 and load it to Amazon Redshift for further parallel processing of large quantities of data.

Data Visualization Tools

The tools for data visualization contain Business Intelligence tools, which conveniently are used to visualize the data for further use. All data from the data warehouse or data lakes are used as an input for these services that can generate reports, and graphs and give a general look at the data.

From the broad range of AWS services, Amazon QuickSight is a tool that easily generates BI dashboards. As a service, it can be used on different devices through the option of integrating the reports and graphs into different web applications and portals. Additionally, it must be noticed that during Data Engineering on AWS, Amazon Redshift can be integrated with many other Business Intelligence and Business Analytics tools.

Use Case

Companies often generate different types of data from different sources, from which it is necessary to draw key points and insights that have value for the business. As an example of Data Engineering in cloud environments, we will take a look at a case that Data Masters worked on for a client from the United Kingdom. The case study is available here.

The client company faced a challenge where they collected large amounts of data on a daily bases from several different sources. Accordingly, the poor data administration contributed to significant data losses and provided inconsistent information for their Artificial Intelligence models. The solution is created on AWS, considering the fact that the client was actively using the AWS cloud platform.

In the first phase of the project, it was necessary to distinguish the different sources of data. The process started with extracting data from on-premise databases to Amazon S3. Considering the different sources and types of data, we created a hierarchy for storing raw data in Amazon S3. In this particular case, the main challenge arose from the multiple data formats and the data subjected to different granularities.

Due to the need for standardized data schemes, we decided to use the service AWS Glue. In addition to the possibility of transforming data, Glue provides the option for directly extracting data from on-premise databases that the client has.

In the second phase of data extraction and transformation, we used the service AWS Glue. Glue Jobs are available as part of the AWS Glue service, which are scripts written in Spark or Python programming languages. With these scripts, we enabled access to an on-premise database from which we loaded data that was stored on Amazon S3. Other scripts were used for extracting data from Amazon S3, transforming them by using Python Data Frames and saved as partitioned files, which is important for optimization of data loading. From the transformed data we generated tables within the service as the functionality Glue Crawler automatically generates a data scheme.

In the next phase of the project, we already had created partitioned tables of data that needed to be accessed with SQL queries, a process that is provided by Amazon Athena. This service is integrated with the tables created in AWS Glue, which can interpret the queries. In this particular case, for optimal and quick access to the data, we generated a database using the Data Vault methodology. Using Amazon Athena, through SQL scripts we generated the database structure, using data from the tables generated in the previous step. Through this methodology, we allowed a logical division of the data and avoided redundancies in the architecture.

Following these steps, we generated the Data Vault by extracting data from different sources and properly transforming them. Even though the process may seem really complex, it can be automated with the service AWS Step Functions, which makes the developer’s job easier.

The solution that we have built, using the aforementioned AWS services, enabled the client to optimally use his data. The Data Vault methodology contributed to better results in the client’s work, they also used fewer data storage resources and saved money.

Advantages of Cloud Data Engineering

There are many advantages to proper data administration: in addition to appropriate data storing, it grants optimized access to the data and correct extraction of key business values for the companies. Taking into account that on a daily basis companies generate large quantities of data, data manipulation through cloud platforms is crucial and beneficial for them.

Data Engineering on cloud platforms provides scalable data administration of large amounts of data, utilizing the advantages offered by the services incorporated in the cloud platforms. As we mentioned in the use case, just using some of the many available services provides full data flow without manually managing the data warehouse (and without additional calculations). From proper data, administration emerges positive results, new business knowledge, and many benefits for the companies.