Introduction
With the growing number of Internet-connected devices generating log data, there is a need for scalable solutions that can ingest, process, and analyze this time series data in real-time. AWS provides managed services well-suited for building real-time analytics architectures to tackle these challenges.
In this blog post, we will explore an architecture that leverages AWS services to ingest log data from devices in real-time, process it for insights, and store it for analytics and visualizations. We will discuss design considerations, best practices as well as advantages and disadvantages of this approach.
Challenge
The main challenge that organizations face when analyzing large volumes of data in real-time is handling the scale of data ingestion and processing with low latency. Often data is continuously generated and any delay in analysis can impact the ability to monitor for issues or take timely actions. The inability to handle massive amounts of incoming data can also lead to data loss.
Solution
The solution aims to build a scalable data ingestion, processing, and analytics pipeline on AWS cost-effectively. Raw data is ingested from tracked devices and is processed downstream to build real-time visualizations. It is also stored in long-term storage to perform analysis of historical data for gaining further insights over time.
The workflow is divided into several stages: ingestion, processing, storage, and visualization. Let's explore each one.
Data ingestion
Kinesis Data Stream continuously collects the raw data sent by the tracking devices and stores it for a specified period.
Data processing
Amazon Managed Service for Apache Flink (AMSAF, formerly known as Amazon Kinesis Data Analytics) reads the data stream from Kinesis Data Streams in real time. It performs operations such as cleaning the data, transforming the data, and detecting anomalies by using Apache Flink. Another useful use case for AMSAF is to continuously enrich the data stream with context by integrating with S3 and leveraging metadata stored inside. For example, if a healthcare company is streaming data from medical devices to track the vital signs of patients, an appropriate application of AMSAF would be to analyze the stream in real-time for metrics exceeding the threshold and take appropriate action. Also - the stream could be enriched with context by mapping raw sensor readings to standardized medical terminology/codes, enabling easier consumption of records by downstream systems.
Amazon Kinesis Firehose buffers incoming data from the data stream. Whenever the specific buffering size is full or the buffering interval is reached, the data is stored in an S3 bucket. Similarly to AMSAF, Amazon Kinesis Firehouse can also perform operations on the data by specifying a Lambda function for the job - for example, transforming JSON raw data into a CSV or TSV file. However, in contrast to AMSAF, Amazon Kinesis Firehouse supports only near real-time.
AWS Glue performs the ETL jobs and is used for more complex transformations. The Glue Data Catalog is the component of AWS Glue that extracts a schema from the data stored in S3 and uses it to operate on the data. The ETL process modifies the unstructured and semi-structured data from the data lake by giving it a defined structure and then loading it into the data warehouse. Going back to the healthcare company example, Glue can be used to encrypt or discard sensitive patient data by using some of the pre-built transformations. It is important to note that in case the data is encrypted inside S3, the Glue service needs access to the KMS key to manipulate the data. Glue ETL jobs can be triggered on a schedule or when an event has occurred (such as uploading an object into S3)
Data storage
Amazon Timestream stores the stream of clean, processed, and transformed time series data, to be further leveraged for creating real-time dashboards and analysis.
Amazon Redshift provides storage for the large amounts of historical data produced over time from the devices. As previously mentioned, the ETL job transforms the data into the desired scheme and then it is loaded into the warehouse where complex analytic queries are run. Afterward, the data can be further separated into different data marts and integrated with Business Intelligence (BI) tools such as Amazon Quicksight, PowerBI, Tableau, etc.
Amazon S3 is used as a data lake or a landing zone where all the raw data from the stream is stored over time. It houses the data before ETL jobs are run to transform it into the desired shape. However, it is not just a temporary data storage, before it is picked up by Glue - many use cases call for the availability of a data lake where data is in its raw, unstructured, or semi-structured format. One example is using Amazon SageMaker to build machine learning models directly from the data lake without needing to move it elsewhere first. Tools like SageMaker Data Wrangled and Clarify allow for the preparation and exploring the data stored in S3.
Data Visualization & BI
Amazon Managed Service for Grafana supports time-series data sources and allows users to visualize, create alerts, and query the data in real-time. The dashboards represent the metrics as they are coming in from the stream so the users can analyze the data and take timely actions if necessary. In a healthcare context, doctors, nurses, and other medical personnel might need to observe patient vitals coming in real-time and make critical decisions based on that information.
Amazon Quicksight leverages the data from the data marts and creates visualizations based on business analysis to infer insights from their data. Since Amazon Redshift stores historical data, the dashboards from Amazon Quicksight can show trends over time, build forecasts, and create what-if scenarios with the available data.
It is important to note that the presence of a data warehouse (Amazon Redshift) and ETL jobs (AWS Glue) can sometimes be disadvantageous compared to querying the data directly from S3 by using a service like Amazon Athena. The reason is in the additional overhead required - setting up ETL transformations, loading the data into a data warehouse, creating data marts by subsetting the data into logical units, etc. Athena can be run directly against data in S3 making the process simpler. Another reason to use Athena is to adhere to a completely serverless architecture. While Redshift does offer a serverless version, there are still some limitations compared to the native version.
However, Redshift is more suited for complex analytics on large structured datasets and overall managing metadata and schemas is easier as Redshift has its catalog. The data warehouse provides high concurrency for thousands of simultaneous queries using its massively parallel processing (MPP) architecture. It is still a valid architectural choice depending on the concrete business case and requirements.
Conclusion
In summary, as more and more devices generate data, there is a growing need for scalable solutions to handle the real-time ingestion, processing, and analysis of this data. Examples of such data are webpage clickstreams, server logs, sensors, medical equipment, etc. AWS offers services that are suited for creating real-time data processing and analytics pipelines to deal with these tasks.
This blog post has explored an architecture that uses AWS services to capture log data from devices in real-time, build continuous visualizations, store it over time, and perform further transformations and historical analysis to gain more complete insights.
About Several Clouds
At Several Clouds, we are passionate about the public cloud as well as the DevOps culture and practices. We assist in modernizing legacy systems and do cloud migrations to achieve more secure, adaptive, and cost-effective environments.
Our AWS architects all possess deep knowledge and extensive experience throughout the full cycle of building business cases, planning, architecting, implementation, and building playbooks and runbooks to help you with:
Cloud adoption and migrations
Cloud training and talent transformation
Build secure and compliant cloud environments
Implement the DevOps and DevSecOps practices
Big Data and Machine Learning
Serverless and Cloud-Native Development
Learn more at https://www.severalclouds.com/