– xpresso Team
What is Data Versioning?
Versioning is linked to a change due to reprocessing, correcting or appending additional data in the structure, contents, or condition of the entity in question – be it documents, software, pieces of code, data science models or any other collection of information.
Specifically, data versioning refers to the process of uniquely identifying data, similar to categorizing code, to enable pulling and using any version of the data as required. In simpler words, it is a method to track changes associated with fluctuating data.
Why Data Versioning?
When data is identified uniquely, data scientists can determine whether and how data has changed, which version of a dataset they are working with, and understand if a newer version of a dataset is available. Also, with proper versioning, data scientists can understand data provenance, what processing step caused the data to change, and how that propagates across the data processing pipeline. Explicit versioning allows for repeatability in experiments, enables comparisons, and prevents confusion.
Let us understand in greater detail why data versioning is important:
- Ensure better training data: ML comprises rapid experimentation and iteration, and training models on data. Thus, training on incorrect data can have disastrous results for the outcomes of a ML project.
- Track data schema: Enterprise data is usually obtained in batches and often minor changes in the ML schema are applied over the course of a project. With proper versioning, you can easily track and evolve the data schema over time. You can also understand whether these changes are backward and forward compatible.
- Continual model training: In production environments, data is refreshed periodically and may trigger a new execution of the model training pipeline. When such automated retraining occurs, it is important to have data versioned for tracking model efficacy.
- Enhance traceability and reproducibility: Data scientists must be able to track, identify the provenance of data, and point out which version of a dataset reinforces the outcomes of their experiments. They should be able to re-run the entire ML pipeline and reproduce the exact results each time as it is critical input for the modelling process. Thus, it is imperative that the original training data is always available. Hence, from a reproducibility / traceability perspective, proper versioning is critical.
- Auditing: Proper versioning ensures that integrity of data-based activities is upheld by identifying when modifications are made. By monitoring and analyzing activities of both users and models, intentional and accidental lapses in user behavior can be identified. Data science auditors can thus analyze the effect of data changes on model accuracy, and determine best ML practices for the enterprise.
How it works?
Data versioning libraries, available with xpresso.ai, enable you to version control your datasets. These datasets and models are stored in a customized data repository. The xpresso.ai Control Center enables you to view the stored datasets and models via the ‘Data Version’ explorer, thus, providing a rich yet simplified user experience.
Whenever we create a new project in xpresso.ai, a repo for data versioning with the same name as the project is also created. That means as a developer or pm, one can only access a repo if he has either developer or owner permission to its corresponding project.
Based on how you want to use the Data Versioning features, it can be accessed either through the Control Center GUI or using a Python library.
Let us take a look at how it is done using python libraries. Every feature related to versioning are available by importing the VersionControllerFactory library.
The various methods supported under this library are:
- List repositories
- Create branch
- Push Dataset
- List commits
- List Datasets
- Pull Dataset
We also provide a very interactive user interface to access the versioning system using our Control center.
Using xpresso.ai Control center, you can view repo and datasets, create branch, push dataset and even download them in to your local system.
Similar to managing code versions in Bitbucket or Git, models, training sets, and test datasets can be automatically versioned, pushed, pulled, listed, and compared easily using xpresso.ai. Thus, for example, if the training data goes through various transformation and feature extraction stages prior to training, data scientists can maintain versions of each of these stages. They can even experiment with different transformation techniques, versioning the output of each, prior to deciding which one works best for them.
So, whether you are using data to visualize and explore, or clean, or transform to use within your machine learning algorithms, it is advisable to always version your data for better traceability, reproducibility and audit. xpresso.ai has a very strong set of features that will help you achieve this.