Building a Data Mining Solution
– Narender Kumar
Let’s assume we have a huge amount of data in our data lake related to sales. It can be very difficult to find some specific data like downloading data where milk sale was higher than 500 ltr in January month of 2019.
In this blog, we will learn how we can build a scalable solution with a simplified GUI that can be used to dig into huge data stored into data lake and provide us some specific data that we need.
We can develop the GUI as a webpage that takes inputs such as product type, sales limit, date, etc. and provide us a URL link of the data that we can click and download related data.
We have implemented this solution using Azure services but we can use similar services by any cloud provider/open-source as well.
First, we need to create the metadata about the data we have. We need to gather the information which can be used to map the items in the GUI. We can create a few hive tables with this metadata.
2. Indexes of metadata:
We need to create indexes of metadata so that we can quickly query the metadata and present it to GUI.
We can push our data to the Azure SQL server and create indexes on top of it using Azure Search. Azure search provides a REST API that can be used by web services to query the metadata.
3. GUI and web jobs:
We can create a good webpage we can take the user’s inputs on search criteria and provide downloadable links to the user. The web service can take user inputs and query the indexes from Azure Search using the REST API. Based on the results from the Azure Search we can run a job that can download the relevant data from data lake to an Azure blob. Azure Active Directory can be used for authentication.
4. Usage Insights:
We can use Azure application insights and log analytics and build the application usage dashboards to show how our application is being used and which data is being searched frequently.
With this architecture, we got a good solution that can be used by any user without any technical background. In our case, this solution has provided great help to our data scientists to improve their efficiency and save time. Data scientists can now spend their valuable time to analyze the specific data instead of digging into the data lake to mine the specific data.