Transverse/Data Science Tools

Document classification tool

The client's Data Scientists reinvented the wheel for each text classification project: same pre-processing steps, same models, same recoded functions each time. We have developed a generic Python package, interpretable and accessible to non-specialists, which is now used as standard on several projects.

Problem

An important part of the client's Data Science projects were based on the classification of text documents. With each new project, the Data Scientists repeated the same steps: pre-processing, vectorization, training, evaluation. The code was neither shared nor reusable, which slowed down each project, created inconsistencies between approaches, and made the results unusable by non-technical profiles. It was necessary to industrialize these common building blocks into a generic, reliable and accessible tool.

Vue rapprochée d’une coupe transversale colorée d’une géode montrant des couches concentriques de minéraux en jaune, marron, rouge et vert.

Solution

What we built

We deployed 1 Data Scientist to design and publish a Python document classification package, designed to be used by Data Scientists as well as by less technical profiles.

Step 1 — Pre-treatment pipelines. Development of standardized and configurable textual data cleaning and transformation pipelines, covering all the classical steps of NLP.

Step 2 — Scikit-learn integration. The package relies natively on scikit-learn objects to ensure compatibility with the existing ecosystem. Each stage of the pipeline can be customized by the user without breaking the whole thing.

Step 3 — Interpretability and optimization. Integration of an interpretability module (LIME, SHAP) to explain predictions, and a hyperparameter search module (Optuna) to automatically optimize model performances.

Step 4 — Publication and adoption. Publication of the package internally, demonstrations to Data Scientist teams, and iterative addition of functionalities according to the needs reported by users.

Projects in the same category

See all projects

Supply Chain Optimization Application

A pharmaceutical distribution player had to rethink its entire supply chain: pharmacy assortment, inventory management and delivery channels. Operational research alone was no longer enough. We built the application that transformed its operations.

Social media trend detection

In a market where consumer behaviors change faster than decision cycles, a customer needed to anticipate trends instead of experiencing them. We built the platform that turns social media noise into actionable signals.

Yield of agricultural fields

The climate subsidiary of a French insurance leader needed to predict field yields across Germany to price a new drought insurance offer. Internal data was not enough. We built the predictive models that made the product marketable.

Customer needs analysis platform

A large French group needed to understand the current and future needs of its customers by aggregating a massive volume of consumer data. After a POC of 8 weeks, we industrialized a complete platform deployed on all French entities and is now being extended at the group level.

Skin scoring and analysis

A major player in luxury cosmetics had developed a skin scoring algorithm. Problem: no one could verify what the AI was based on to make its diagnosis. We built the visualization system that makes predictions transparent and deployable in stores and on mobile.

Beaconing detection

A large group's security team needed to identify beaconing signals, regular network traffic sent by potentially compromised machines to servers controlled by attackers. The volume of logs made manual detection impossible. We built the anomaly detection system capable of processing this data on a large scale.