Context and problematic
Kairos is a research assistant.
Objective : Take a user request, enrich, and crawl the results while presenting the top matches, their similarity, and their content (topic extraction)
Goals
The goal is to be able to select the most interesting content to read, then be able to navigate through the documents, understand their subjects and reports (topic/similarity clustering).
Our intervention
1 Data Scientist
- The work on similarity is done through the comparison of word & document embeddings.
- To compare to the query, embedding is carried out on the words and calculates the cumulative energy, which is required to translate the words of the query to the document’s matching words.
- Clustering through embedding is consistent, working on extracting the most important sentences makes it possible to separate the texts, even more so by removing unnecessary sentences.
- Regarding the topic extraction, in the long run, a classification with multilabel topics will be much sturdier and more efficient.
- Using an unsupervised generative algorithm such as the LDA speeds up this labeling.
Results
All features have been developed.
The project has achieved all the defined objectives and is now in the industrialization phase.
Technical environment
Python, Keras, Tensorflow, Gensim, Spacy, Nltk, Docker, Scrapy, BeautifulSoup, Git