An important part of the client's Data Science projects were based on the classification of text documents. With each new project, the Data Scientists repeated the same steps: pre-processing, vectorization, training, evaluation. The code was neither shared nor reusable, which slowed down each project, created inconsistencies between approaches, and made the results unusable by non-technical profiles. It was necessary to industrialize these common building blocks into a generic, reliable and accessible tool.
Document classification tool
The client's Data Scientists reinvented the wheel for each text classification project: same pre-processing steps, same models, same recoded functions each time. We have developed a generic Python package, interpretable and accessible to non-specialists, which is now used as standard on several projects.

Problem

Solution
What we built
We deployed 1 Data Scientist to design and publish a Python document classification package, designed to be used by Data Scientists as well as by less technical profiles.
Step 1 — Pre-treatment pipelines. Development of standardized and configurable textual data cleaning and transformation pipelines, covering all the classical steps of NLP.
Step 2 — Scikit-learn integration. The package relies natively on scikit-learn objects to ensure compatibility with the existing ecosystem. Each stage of the pipeline can be customized by the user without breaking the whole thing.
Step 3 — Interpretability and optimization. Integration of an interpretability module (LIME, SHAP) to explain predictions, and a hyperparameter search module (Optuna) to automatically optimize model performances.
Step 4 — Publication and adoption. Publication of the package internally, demonstrations to Data Scientist teams, and iterative addition of functionalities according to the needs reported by users.
Projects in the same category





