– Presales Team
Introduction
Traditional extract, transform, load (ETL) solutions have, by necessity, evolved into real-time data streaming solutions as digital businesses have increased both the speed in executing transactions, and the need to share larger volumes of data across systems faster.
Traditional ETL solutions fall short in the following ways:
- Real-time data ingestion
- Real-time data delivery
All the activities of the transformation phase of ETL like data cleansing, enrichment, and processing need to be done more frequently as the number of data sources and volume skyrocket. There is also an obstacle to gain real time business insights by feeding data into machine learning and AI algorithms by traditional batch ETL processes.
Real time data streaming using a framework like Apache Kafka offers the following advantages over batch processing ETL:
- Automatically extract, transform, and load data as continuous, real-time streams
- Enhance operational efforts and reduce work
- Deliver data from up-to-date sources, whether it is coming from hundreds of millions of daily events from different devices, locations, or cloud computing, or physical servers
Solution Overview
Abzooba has developed a real time streaming framework to facilitate the ingestion and processing of streaming unstructured data.
The framework allows ingestion from several sources such as:
- Social Media sources
- Paid subscription sources
- Publications and News Articles, etc.
The framework comes with prebuilt processors to facilitate the following tasks
- Clean & Ingest
- Remodel & Enrich
- Implement machine learning pipelines with recommender pipeline
The data streaming pipeline architecture additionally offers scalability in the following ways:
- Grow both horizontally and vertically to support high number of users and content
- Support multiple data science projects simultaneously
Architecture Overview
The key pillars of the architecture of the solution are as follows:
- Event Sourcing pipeline – Kafka
- Real-time streaming
- Ingestion and transformation of data
- Analytics processes, triggered directly from pipeline
- Storage Tiers
- Hot: RedisDB, In memory shared cache
- Warm: SQL Database, Document Database
- Cold: S3, Data Lake
The architecture was envisioned with the following considerations in mind:
- Right sized tradeoff, Cost vs. Performance (tunable)
- Horizontal and Vertical on-demand scaling of workloads
- Cloud agnostic platform, hybrid cloud capability
- Modularity, pluggability, and upgradeability
- Event Driven Architecture (EDA)
Event Driven Architecture enables organizations to fully leverage the power of real time data streaming frameworks like Apache Kafka in the following ways:
- EDA is particularly well suited to the loosely coupled structure of complex engineered systems. Components can remain autonomous, being capable of coupling and decoupling into different networks in response to different events. Thus, components can be used and reused by many different networks
- Since events are recorded as they occur, enterprises have access to all the data and context they need to make the best decisions
- EDA is scalable because it is implemented using a modern distributed, and fault-tolerant architecture. Loosely coupled computing nodes work together to form a cohesive event-processing engine with unlimited scale. This decoupled and distributed environment gives the organization the power and peace of mind to handle any set of workloads
Ingestion & Streaming Unstructured Data:
- Events are first ingested from raw data sources and fed into Kafka stream to be processed by standardization module
- The standardization module standardizes the data as data is being picked up from multiple sources in multiple formats
- Standardized data is then analyzed using Cognitive Data Enrichment, which comprises of Named Entity Recognition, Document Relevancy and Topic Mapping processes; the outputs from the processes are then fed to corresponding Kafka streams (K1, K2,….,K10)
- Document Assembler combines all three outputs from the Kafka Stream and pushes the document for further enrichment
- Enriched data is received by the Cognitive Content Coordinator process which persists the data in S3, database (SQL) and Redis2
Machine Learning Pipeline (Recommender pipeline for article similarity):
- Parallelly the normalized data is ingested and processed by Document Similarity and Document Clustering processes
- The article is updated with the similarity score and clustering information as and when those results are received by the content coordinator
- Cognitive API process is invoked by the user interface to fetch the recommendations for the users
- Once the data is received by the Recommender then the data is cached in-memory (Redis2) for quick recommendation purposes
- The Cognitive API captures the click stream data, previous recommendations for the user and records in Kafka streams which in turn is ingested by the process Cognitive User Coordinator
- The Cognitive User Coordinator persists the data in storage layers for faster recommendation
Benefits
- Data is being persisted in three different layers, as per relevancy and need of the current processes being executed, to deliver cost benefits without impacting the performance
- Processes have been split based on computational requirements and persistence, making CPU and memory usage available for processes which require high memory / CPU usage
- Processes like content and user coordinator are being used to manage persistence and latency, thereby reducing memory / CPU usage
- EDA helps in providing options to scale up the solution as per the volume of articles to be processed and the number of end-users
____________________________________________________________________________________________________________________________________