Welcome
/
Business Cases
/
Document classification tool
Transverse/Data Science Tools

Document classification tool

The client's Data Scientists reinvented the wheel for each text classification project: same pre-processing steps, same models, same recoded functions each time. We have developed a generic Python package, interpretable and accessible to non-specialists, which is now used as standard on several projects.

Problem

An important part of the client's Data Science projects were based on the classification of text documents. With each new project, the Data Scientists repeated the same steps: pre-processing, vectorization, training, evaluation. The code was neither shared nor reusable, which slowed down each project, created inconsistencies between approaches, and made the results unusable by non-technical profiles. It was necessary to industrialize these common building blocks into a generic, reliable and accessible tool.

Vue rapprochée d’une coupe transversale colorée d’une géode montrant des couches concentriques de minéraux en jaune, marron, rouge et vert.

Solution

What we built

We deployed 1 Data Scientist to design and publish a Python document classification package, designed to be used by Data Scientists as well as by less technical profiles.

Step 1 — Pre-treatment pipelines. Development of standardized and configurable textual data cleaning and transformation pipelines, covering all the classical steps of NLP.

Step 2 — Scikit-learn integration. The package relies natively on scikit-learn objects to ensure compatibility with the existing ecosystem. Each stage of the pipeline can be customized by the user without breaking the whole thing.

Step 3 — Interpretability and optimization. Integration of an interpretability module (LIME, SHAP) to explain predictions, and a hyperparameter search module (Optuna) to automatically optimize model performances.

Step 4 — Publication and adoption. Publication of the package internally, demonstrations to Data Scientist teams, and iterative addition of functionalities according to the needs reported by users.

Projects in the same category