Classification of WEB pages

Context and problematic

Having hundreds of thousands of web pages, and having reliable knowledge of the data, Société Générale wishes to classify all the pages of its websites and be able to improve the product recommendation processes through the log user connection.

Goals

Prove the added value of deep learning in textual classification even on small samples (<1500 web pages) Provide an analyzable model: the business must be able to understand the model choice. Classify these images according to several categories defined by the profession.

Our intervention

2 Data Scientist and 1 Data Engineer in SCRUM mode

  • Web scraping and data set cleaning
  • Preparation of the data (standardization, etc.)
  • Data encoding using Tf-IDF, Word2vec, Doc2Vec
  • Modeling using a Bidirectional Bidirectional Sequential Neural Network-LSTM
  • Model interoperability via heatmaps

Results

95% accuracy and 90% of F-measure on the test set

Technical environment

HDFS
Python
Spark
Pyspark
H2O
Kera
Gensim

Together with our customers, we build solutions that change and facilitate their daily lives.

Aide à la création de médicaments

Plateforme d’analyse de besoins clients

Conception et industrialisation du SI analytics

Prédiction de retards

Analyse de visage pour recommandation produits

Application d’optimisation de la Supply Chain

Scoring et analyse
de la peau

Analyse de Forums

Personnalisation de contenu

Analyse des activités de support IT

Détection de tendances sur les réseaux sociaux

Détection
de beaconing

Outil de classification de documents

Détection de cancer via Deep Learning

Conception de plateforme de veille stratégique

Rendements
des champs agricoles

Conception du Data Hub et implémentation

Analyse et prévention des problèmes Skype

Assistant d’aide à la recherche

Classification de pages Web