Document classification tool

Context et problematic

A large part of the data science projects conducted by the client is based on text documents. To process and model this data, Data Scientists often use the same tools/models/functions.

In order to make these steps faster, more generic and accessible to non Data Scientists, we have developed a common package that performs all the classical tasks related to document classification by Machine Learning.

Goals

Intuitive and very high-level user API

Support scikit-learn models and methods

Ability to customize each end of the pipeline

Generic code making it easy for other data scientists to contribute

Be able to interpret predictions

Be able to search for hyper parameters

Our intervention

1 Data Scientist

Development of data pre-processing pipelines
Managing the link with underlying scikit-learn objects (integration)
Development of the interpretability module
Development of the hyperparameter search module
Publishing the package and demos to other Data scientists
Adding features based on specific needs of Data scientists

Results

Package published !
Objectives achieved.
Used by several projects (including the support ticket classification project).

Technical environment

Python (scikit-learn, pandas, optuna, lime, shap, nltk, MLflow, plotly)
Pytest
Git/Github

Together with our customers, we build solutions that change and facilitate their daily lives.

Our clients cases

Aide à la création de médicaments

Plateforme d’analyse de besoins clients

Conception et industrialisation du SI analytics

Prédiction de retards

Analyse de visage pour recommandation produits

Application d’optimisation de la Supply Chain

Scoring et analyse
de la peau

Analyse de Forums

Personnalisation de contenu

Analyse des activités de support IT

Détection de tendances sur les réseaux sociaux

Détection
de beaconing

Outil de classification de documents

Détection de cancer via Deep Learning

Conception de plateforme de veille stratégique

Rendements
des champs agricoles

Conception du Data Hub et implémentation

Analyse et prévention des problèmes Skype

Assistant d’aide à la recherche

Classification de pages Web