– xpresso Team
Introduction to Data Management
Data Management can be seen as the practice of handling the end-to-end lifecycle of data, from its creation to retirement. This includes obtaining, validating, storing, and analyzing the required data to ensure it is readily available and reliable. Typically, we deal with enterprise data that is characterized by wide variations in volume, velocity, variety, and veracity.
Volume alludes to the size of the data sets that are analyzed and processed; for example, all credit card transactions within a continent is a high-volume data set. Velocity refers to the speed at which data is produced; for example, social media posts. Variety is about the numerous sources and the diversity of data (either structured, semi-structured, or unstructured); for example, CCTV audio and video files generated at various locations in a city. Veracity is about the quality of the data and the provenance of its sources. High veracity data is valuable for analysis and contributes meaningfully to the overall results. However, low veracity data contain useless data or noise and can have a negative effect on analysis accuracy. Distinct (distributed) processing techniques, rather than traditional storage and processing capabilities, and advanced tools (analytics and specialist algorithms) are essential for handling this data.
A typical ML project has to go through various steps before actually building the model, viz. fetching, exploring, visualizing, and versioning the data. Let’s take a look at some of these steps in greater detail below.
Obtaining data from diverse sources requires high-end data connectivity. Efficient data connectivity enables relevant, personalized engagements for businesses with their consumers. It also connects variegated data sets and applications, including data from different ecosystems. With data connectivity, every consumer interaction can be actionable, pertinent, and measurable. A major part of our data transformation and enterprise MLOps journey, orchestrated by xpresso.ai, involves setting up the required infrastructure and collecting data on a continual basis from diverse sources.
Around a decade ago, these data sources were limited to databases and Excel files. Today, we derive our data from relational and non-relational databases, diverse flat files, spreadsheets, social networks, website content, scanned documents including PDFs, file systems, images, audio (including podcasts), and video clips; the list goes on. As data grows complex and evolves, the data scientist has little time to focus on how to scrape data sources and derive meaningful information needed for analysis. Standard connectors — typically handled by data engineering — are used for scraping databases and file systems. However, enterprises have to contend with an ever-increasing list of data sources. Thus, the need for a uniform and expandable mechanism to handle these sources briskly, arises.
xpresso.ai has specialized Python libraries and connectors which support both structured and unstructured data, enabling access to diverse data sources.
How it works?
You begin with importing the xpresso.ai libraries
Structured Dataset objects reflect data contained in a single Excel sheet/database table / CSV file.
Unstructured Dataset objects reflect data contained in a set of binary files (e.g., images, videos, etc.).
Databases currently supported: MySQL, MS SQL Server, Mongo, and Cassandra
Here is a sample code to import data from a MS SQL database:
File Systems currently supported: NFS, HDFS (Support for AWS, GCP and Azure under development)
Here is another sample code to import data from file system:
- Relational DBs(MySQL, SQL Server, Oracle, PostGRES), NoSQL(Cassandra, MongoDB)
- Social Media – Twitter, Instagram, YouTube Facebook (structured data), LinkedIn, Glassdoor (unstructured data)
- Cloud file systems – AWS S3, etc.[SG1]
- FTP / SFTP servers
- Applications such as SalesForce, JIRA, etc.[SG2]
- Big Data sources such as cloud (private, public, third-party platforms), web, IoT
After obtaining the data, a data scientist can explore, visualize, transform (eg., clean) the data, extract features from the data, and version the data. These activities often have to be automated to accommodate prescheduled runs. In case the data source is updated, an automatic trigger can also be scheduled through continuous monitoring. xpresso.ai provides tools for all these activities that are either available within the code or as intuitive elements through a drag-and-drop interface.
In addition, these runs need to be stored for tracking errors and auditing. xpresso.ai supports all these features, saving the analyst from the tedious work of coding and monitoring such pipelines. Thus, numerous internal and external data feeds within enterprises from various sources can be processed simultaneously. A wide variety of data, coming in at a high velocity and in huge volumes can be seamlessly merged and consolidated. All of these are achievable at a fraction of the usual cost and much less time, with the help of connectors and reusable components.