Artigos

25.05.2022

21 min

MLOps: Como o DVC gerencia de forma inteligente seus conjuntos de dados para treinar seus modelos de aprendizado de máquina com base no Git

Sommaire

Décrivez votre problématique

Lorem ipsum dolor sit amet consectetur. Nisl sit varius id tincidunt aenean. Lorem amet enim mauris quam cursus tempor purus adipiscing.

Prendre rendez-vous

This article belongs to a series of articles about MLOps tools and practices for data and model experiment tracking. In the first part, we explained why data and model experiment tracking was important, and how tools like DVC and Mlflow could solve this challenge. Today, we’ll see how Data Version Control (DVC) smartly manages your data sets for training your machine learning models on top of Git.

By Samson ZHANG, Data Scientist at LittleBigCode

What are we talking about? DVC is a MLOps tool that works on top of Git repositories and has a similar command line interface and workflow to Git. It is designed to tackle the challenge of data sets traceability and reproducibility when training data-driven models.

Why do we need DVC ?

All data-driven models require data to be trained. Managing and creating the data sets used for training data-driven models requires a lot of time and space. Depending on the project, there can be up to thousands of versions of the data set to train the models. This can quickly become muddled due to multiple users altering and updating the data which can greatly jeopardize the traceability and reproducilibity of experiments.

In your data scientist career, you probably experienced data versions tracking issues when exploring and cleaning your data set, just like me.

For instance, I often worked on computer vision problems with thousands of images/annotation files. Counting the raw noisy data, the cleaned data and the preprocessed data, there are already 3 different versions to keep. And that is still without keeping track of some processing steps results!

Without DVC, a possible approach would be zipping files and storing hashes (file content checksum), and locations in Git commits. The data set would be fully duplicated for each version. It would be complicated to update and to keep track of. Just imagine the work you would have to do each and every time you have new data to add or wrong labels to correct!

This iterative process on the data set can be applied to many data science projects and it is not scalable without proper tools.

DVC has been created to exactly handle this iterative process in an efficient way.

Why use DVC for data version management instead of other tools such as Git or Mlflow ?

Mlflow is not designed to track a lot of large files (for instance, thousands of images) as it does not optimize storage for file duplication. Tracking datasets version with Mlflow would be inefficient. Mlflow itself does not guarantee the reproducibility of a data set used during an experiment run, unless you save the whole data set during each run, which is not scalable.

Git is unsuited for large files versioning in general (especially for datasets). Furthermore, saving your data set with your source code can be a huge security breach as anybody that works on the code can access potentially sensitive data (even worse for public git repo).

Those are the main reasons that motivate the use of an additional type of tool for data versioning such as DVC for improving your MLexperiments tracking experience. DVC complements Mlflow and Git in order to provide a complete ML tracking experience.

Technically, DVC is a file-versioning tool that can work with any type of data (image, text, video) as it saves files. But the latter does not mean that it is adapted to version complex data types for ML purposes such as large video files because DVC simply tracks file versions with hashes (content checksum). For instance, a few seconds modification to an 1h-long video file (several GBs) results in 2 full 1h-long video files stored which implies a lot of duplication.

How does it work ?

You can version datasets in your Git repository by only storing small *.dvc metafiles (text) tracked by Git commits (cf. figure 1). It has an optimized versioning capability like git by only storing the minimal quantity of information to describe the data across all the data set versions in a repository. The same file appearing in multiple data set versions is stored only once.

Figure 1. How DVC works. Source: DVC.org

Project structure

A DVC repository is a Git repository that tracks DVC files. Setting up a DVC repository and do data versioning is easy.

Let’s take a look at the composition of a DVC repository :

For a Git repository to be also be a DVC repository, there are only 2 elements needed:

►.dvc/ subdirectory at the project’s root. This directory mainly contains customizable config files. By default, it also contains the DVC repository cache;

►*.dvc files

DVC files (*.dvc) are the entry points for versioning data. They are metafiles used by DVC to point to the data in a storage space. DVC files and an URI of the data storage space (local file system, AWS, Azure, GCP…) are the only information needed for versioning data sets. You can think that *.dvc files are like indexes, they are light and easily versionable addresses that point tothe actual data stored in a more suited storage space (cloud, local remote storage).

It means that *.dvc files have to be tracked by Git, in order to track different versions of a data set. Conversely, if a data set version pointed by a .dvc file is not tracked by Git, it can become inaccessible (it is not designed to be accessed without .dvc files) but the data will still exist in the storage.

Basic commands

Like Git, DVC is configurable (remote storage, scope) and has “add”, “push”, “pull”, “checkout” commands for managing your data files. DVC is compatible with all the main cloud providers: Google Cloud, Microsoft Azure and AWS S3, and it does not have any infrastructure requirements.

How DVC manages data set versions and avoids duplication

The local DVC cache (DVC Cache structure) contains all the versioned data sets without file duplicates between versions. This cache can be anywhere on the local system. A working copy of this cache is duplicated with an user-specified file link (copy, reflink, hardlink, symlink) dvc link types into the Git repository workspace for the files to be accessed by the project.

By default, the copy strategy is used. For more details about the file link type, check out the dedicated section “Configure your DVC cache” of this article.

Figure 2. DVC workflow, cache and storage

In order to explain how DVC cache optimizes storage space by avoiding files duplication between different versions of a data set, let us look at an example:

Figure 3. DVC cache workflow

Each “dvc add” command uploads a new version of the data set (cf. figure 3). Each file is saved only once (for any version) and the no-duplication is ensured by file checksum comparison between versions. An internal database maps each file to each data version it belongs to.

Even though DVC is built on top of Git, DVC does not have a history system like Git. There are no explicit branching logic and commit dependencies handled by DVC itself. DVC only reasons on file presence and content, by checking hashes, to determine data versions. The dependency logic between data versions is handled by Git history. It means that you can create different data set versions on different Git branches and DVC maps each file, ever tracked in the repository, to every commit/branches that tracks it.

DVC best practices

After some experimentation with DVC, there are few good practices I think one should pick up when using DVC:

• Use DVC only for data-related tasks such as data set versioning, data processing routines. Not for logging experiment metrics and model weights. Even though the DVC documentation indicates that those features exist with its versioning capability, DVC is not designed to do experiment runs performance comparison, unlike MLflow, without many tricks and saving unnecessary files to the code base.

• Use Data Registry whenever possible in order to centralize data sets that can be shared in different projects. A DVC data registry is basically a Git repository that only contains DVC files (no code) that can version as many data sets as your organization have. An example of data registry setup is developped in the next section of this article.

• Unless you explicitly want to share your project’s DVC configuration such as a remote storage URL for a data registry, never use global configuration (.dvc/config). Prefer your project’s private configuration .dvc/config.local instead by using the –local argument to your configuration-modifying commands. Most of the time, your configuration depends on your local workspace (cache location/type) and you might need to use secrets for cloud remote storage (azure credentials,…). There is no reason to use .dvc/config for it.

• Write descriptive Git commits when versioning data sets, otherwise it can become hard to track meaningful changes in the data sets. This applies to software engineering in general. • Configure DVC repository cache. Do not use default when possible. Use an external cache if you have limited storage resource in your Git repository workspace. If you do not need to edit your DVC-tracked files in place, change your cache type link to save space from copy (default) to reflink,hardlink,symlink Large Dataset Optimization. Most of the time, the best cache configurations are: {reflink,hardlink,symlink}+external cache dir on large disks for SSD(small)+HDD/SSD(large) hardware configuration. • Use DVC, Git hooks for common routine automation (post-checkout, pre-commit, pre-push). In a DVC repository, use “dvc install” command to set up hooks.

Go further in setting up your DVC projects !

Set up a data set registry

DVC is simple to use as it is a thin layer over git repositories. One can directly use an existing Git repository in order to build a DVC repository on top of it and do data versioning for the given git project. The given Git repo would be the main entry point for accessing data set versions. It can be enough for doing experimentation by yourself on a single project, but it would easily become inefficient when you want to share the data sets with other projects. In practice, datasets are often re-used in multiple projects. This is why you should set up a data registry (cf. figure 4) whenever possible.

Figure 4. Data registry in DVC. Source : Data Version Control · DVC

1 • Let’s start from scratch. First create a new Git repository and initialize the dvc repository init on top of it. I recommend you creating a new conda environment for it. (Highly recommended) Configure git hooks for DVC install 2 • dvc remote add Set up the remote storage for your DVC repository, it can either be a local file storage or a remote storage and commit it. (optional) install dvc dependencies for cloud remote storage if necessary (example for azure):

For local file storage : This modifies the content of .dvc/config in your repository, that represents your project’s global configuration. You can check config for more configuration options.

For Azure storage, assuming you have enough credentials, the following commands modify your .dvc/config file for the remote URL and modifies the .dvc/config.local that stores credential only locally. You can find other examples for settings up your Azure storage here remote modify but it is recommended to use a SAS token.

3 • Commit your remote storage config

4 • Download your first data set version to track with DVC. Note that we are downloading data from a public DVC data registry using dvc cli but you can retrieve data however you want:

5 • Track your data set with DVC and git commit *.dvc files to version your data set:

Check the content of the cats_vs_dogs.dvc. It contains meta data useful to DVC in order to track your data set in the remote : 6 • Push your data set to the remote storage. At this point, you can already use this DVC repository in other projects. If you set up a cloud remote storage and push your Git repository to Github/Gitlab, you can even share your data registry with anybody: 7• Modify your data set by adding new images and create a data set version (like steps 3, 4 and 5): if you check the status of your DVC repository with « dvc status », you will be informed of changes: So add and commit the changes : 8 • Your data set registry is set up. Import/use data from a DVC data registry with get and import commands.

Configure your cache

The DVC cache is a content-addressable storage (by default in .dvc/cache), which adds a layer of indirection between code and data. (cf. DVC Cache structure). It is the DVC cache that stores the data sets in the local working environment.

Cache type

Figure 5. The impact of using different file link types on local storage space used with a 100Gb data set

Reflink, hardlink and symlink file linking types are particularly useful when you do not want to have the DVC project cache in the same subdirectory as your source code or if you are lacking space in your workspace partition (cf. figure 5 and Large Dataset Optimization). These file link types allow to save space. Usually, you will not work in an environment that provides reflink. These link types avoid having your DVC-tracked data set duplicated both in your cache and in your local git repository.

Most of time, for large data sets (>1Gb), you would want to use reflink, hardlink and symlink, although in-place edition is not available for hardlink and symlink. Depending on the task you are doing on the data set, you might want to edit the data in place, for instance manual data preprocessing or label fix. In this case, an editable link type should be preferred such as reflink or copy over hardlink and symlink.

You can always switch between different cache types depending on the stage of your machine learning project.

For instance If you need to manually edit your data in place, switch back to an editable link type Finalizar compra

Quando você não precisar mais editar seu conjunto de dados manualmente ou quiser passar para outras etapas do seu projeto (treinamento, avaliação), basta mudar para um dos links mais leves para economizar espaço:

Localização do cache

Na maioria das vezes, estamos executando projetos de aprendizado de máquina em nosso laptop ou desktop com vários discos físicos e recursos limitados. Normalmente, SSD para acesso rápido de E/S onde está seu código e um HDD com grande espaço de armazenamento para seus conjuntos de dados.

Por padrão, o cache estará na raiz do seu repositório git.

Se você configurou um tipo de cache de cópia para seu projeto, seu cache pode ocupar um grande espaço do seu disco. Talvez seja necessário mudar seu tipo de cache para um mais leve: reflink, hardlink ou symlink.

Ou você pode simplesmente mover seu cache para outra partição que tenha mais recursos:

Sobre conflitos de mesclagem de conjuntos de dados com DVC O DVC rastreia as versões do conjunto de dados aproveitando a funcionalidade de controle de versão do Git. Quando se trata de mesclar conflitos, o DVC não tem um recurso embutido de resolução de conflitos, então o DVC também usa o git para resolução de conflitos. Como sabemos, rastreamos apenas arquivos*.dvc com o Git, o que significa que, quando há conflitos de mesclagem, apenas os meta-arquivos são comparados, o que geralmente é insuficiente, pois as versões dos conjuntos de dados não são comparadas. Um exemplo de conflito de arquivos*.dvc ao mesclar:

As 3 situações que normalmente enfrentaremos Para ilustrar, digamos que duas pessoas P1 e P2 estejam trabalhando juntas em um projeto de aprendizado de máquina com um conjunto de dados de imagem. P1 trabalha na filial B1 e P2 funciona na filial B2. O conjunto de dados inicial data/ (rastreado por data.dvc) em que eles estão trabalhando é a versão D1:

• Primeira situação: apenas um dos P1 e P2 altera o conjunto de dados. P1 altera o conjunto de dados e cria uma versão D2 do conjunto de dados na ramificação B1. P2 termina de trabalhar em um novo recurso e não modificou o conjunto de dados D1. O P2 precisa mesclar B1 com B2 e resolver o conflito na diferença entre as versões do conjunto de dados. Como apenas uma das ramificações modificou o conjunto de dados D1 original, P2 pode simplesmente substituir sua versão de data.dvc (na ramificação B2) pela versão da ramificação B1.

• Segunda situação: tanto P1 quanto P2 adicionam apenas imagens sem sobreposição ao conjunto de dados. P1 só adiciona novas imagens ao conjunto de dados D1 e cria D2 em B1. P2 também adiciona apenas novas imagens ao conjunto de dados D1 e cria D3 em B2. Além disso, os subconjuntos de imagens que ambos adicionaram são separados. Nesse caso, a fusão pode usar Git drive merger DVC: Merge conflicts, anex-only data set

• Terceira situação: tanto P1 quanto P2 alteram o conjunto de dados (remoção, adição, modificação). P1 altera o conjunto de dados e cria uma versão D2 do conjunto de dados na ramificação B1. P2 também alterou o conjunto de dados D1 e criou uma versão D3 na ramificação B2. O P2 precisa mesclar B1 com B2 e resolver o conflito na diferença de versões do conjunto de dados entre D2 e D3. Aqui, nenhuma suposição é feita sobre o tipo de modificação no conjunto de dados; pode haver remoções, adições e modificações em qualquer arquivo do conjunto de dados. Se você quiser realmente mesclar todas as modificações em ambas as ramificações, essa é a situação mais complicada. Nem o git nem o DVC podem ajudar diretamente. Você precisa mesclar manualmente os conjuntos de dados.

Sobre o rastreamento de hiperparâmetros com DVC

O DVC também pode rastrear arquivos métricos e hiperparâmetros, mas o MLflow é mais adequado para fazer isso. Por exemplo, o DVC pode rastrear os resultados do experimento enviando arquivos de métricas (para DVC e git repo) e comparar diferentes versões (confirmações diferentes) de um arquivo métrico usando Estúdio DVC. Normalmente, pode-se querer reduzir ao máximo o número de ferramentas usadas, não querer duplicar os resultados com várias ferramentas, como o repositório Git (DVC) e em um servidor remoto (Mlflow).

Além disso, DVC e Mlflow têm abordagens diferentes em relação ao controle de versões de métricas:

O DVC rastreia as métricas do experimento com um commit após o treinamento/geração de resultados do modelo;
O MLflow rastreia os resultados do experimento usando o presente commit usado para treinar os modelos.

A abordagem do DVC é mais leve do que a do Mlflow para rastreamento de métricas, pois o DVC salva arquivos de métricas diretamente no repositório Git, ao contrário do Mlflow, que salva métricas em um servidor remoto. Dito isso, as duas abordagens são conflitantes quando usadas ao mesmo tempo para rastreamento de métricas. Isso significaria que o Mlflow rastreia um experimento no commit N e os resultados seriam rastreados pelo DVC no commit N+1, o que não é prático. O Mlflow tem a vantagem de ter uma funcionalidade de registro automático que torna o rastreamento de métricas fácil e transparente (além de registrar artefatos, entre outros recursos). Com o DVC, seria necessário lidar manualmente com o registro de métricas em arquivos, o que é inconveniente.

Por uma questão de simplicidade e porque esta série de artigos está focada no rastreamento de experimentos usando DVC e Mlflow, eu recomendaria usar o Mflow para rastreamento de métricas em DVC. No entanto, deve-se saber que é tecnicamente possível rastrear as métricas do experimento com o DVC (com algum esforço).

Conclusão

O DVC é uma ferramenta leve de controle de versão de arquivos criada com base nos recursos de controle de versão do Git, projetada para controlar conjuntos de dados. Ele tem um sistema de cache otimizado que evita a duplicação de arquivos entre diferentes dados e versões. O uso de uma ferramenta de terceiros, como o DVC, permite dissociar conjuntos de dados brutos usados para treinar modelos de aprendizado de máquina do código, enviando pequenos metarquivos que descrevem os conjuntos de dados rastreados por um repositório Git. O DVC também pode ser usado para pipelines de pré-processamento de dados. Sua funcionalidade de registro de conjuntos de dados é particularmente útil para gerenciar o compartilhamento de conjuntos de dados entre diferentes projetos de ciência de dados.

Um resumo dos recursos, prós e contras do DVC:

Este artigo pertence a uma série de artigos sobre ferramentas e práticas de MLOps para rastreamento de dados e experimentos com modelos. Quatro artigos são publicados:

PARTE 1 (Clique aqui): Introdução ao rastreamento de dados e modelos de experimentos

PARTE 2 (este artigo): MLOps: Como o DVC gerencia de forma inteligente seus conjuntos de dados para treinar seus modelos de aprendizado de máquina com base no Git?

PARTE 3 (disponível em breve): MLOps: Como o MLflow rastreia facilmente seus experimentos e ajuda você a compará-los?

PARTE 4 (disponível em breve): Caso de uso: acompanhe sem esforço seus experimentos de modelo com DVC e MLflow

Sinta-se à vontade para ler outros artigos se você já estiver familiarizado com os conceitos!

Vue aérienne d'un marais avec de petits cours d'eau sinueux traversant des zones de végétation brune et des berges sableuses.

contato

Vos données sont-elles prêtes pour l'IA ?

Un échange de 30 minutes avec l'un de nos experts pour évaluer votre maturité Data et identifier les premières actions.

Réserver un diagnostic