Context and problematic
Having hundreds of thousands of web pages, and having reliable knowledge of the data, Société Générale wishes to classify all the pages of its websites and be able to improve the product recommendation processes through the log user connection.
Goals
Prove the added value of deep learning in textual classification even on small samples (<1500 web pages) Provide an analyzable model: the business must be able to understand the model choice. Classify these images according to several categories defined by the profession.
Our intervention
2 Data Scientist and 1 Data Engineer in SCRUM mode
- Web scraping and data set cleaning
- Preparation of the data (standardization, etc.)
- Data encoding using Tf-IDF, Word2vec, Doc2Vec
- Modeling using a Bidirectional Bidirectional Sequential Neural Network-LSTM
- Model interoperability via heatmaps
Results
95% accuracy and 90% of F-measure on the test set
Technical environment
HDFS
Python
Spark
Pyspark
H2O
Kera
Gensim