The client had hundreds of thousands of web pages on its sites, but none were reliably classified. Without categorizing content, it is impossible to exploit users' connection logs to understand their journeys and recommend the right products to them. The technical challenge: to build an efficient and interpretable classification model based on a very small training sample (less than 1,500 pages annotated on a corpus of several hundreds of thousands).
Classification of web pages
A player in the banking sector had hundreds of thousands of uncategorized web pages and no way to link the content viewed to the products that were relevant to each user. We have developed a textual classification model that reaches 95% accuracy on a small sample, paving the way for personalized product recommendations on a large scale.

Problem

Solution
What we built
We deployed a team of 2 Data Scientists and 1 Data Engineer to design a comprehensive molecular discovery support system.
Step 1 — Scraping and data preparation. Automated extraction of content from web pages, cleaning and standardization of texts to eliminate HTML noise and non-informative elements.
Step 2 — Semantic encoding. Transformation of text into vector representations that can be used by models, by testing several approaches: TF-IDF for the baseline, Word2Vec and Doc2Vec to capture semantics beyond keywords.
Step 3 — Deep Learning modeling. Development of a bidirectional sequential neural network (Bidirectional LSTM) capable of classifying pages according to categories defined by the business. The model utilizes the context of the text in both directions of reading to maximize content comprehension.
Step 4 — Interpretability. Implementation of heatmaps allowing business teams to visualize which words and text passages influenced the classification. The business can verify that the model is based on the right signals.
Projects in the same category





