Practical example with Datathon2020 Article(content) recommender case
If you are in a Data Science project, most probably you have had the problem of choosing the right approach and steps to follow in a project with strict and short deadlines? Sometimes it even happens whenever you work with freelancers, with which you have different point of view on the things. In this article we will share the steps we used in our own experience from the past few days in the field of retail and e-commerce.
We participated in a limited time competition, where the timing is the key to success. Last week the Datathon2020 took place and we decided to work on one of the given projects. They provide a lot of free data and a specific case, on which to work. We personally liked the most one of the cases – Article recommender case. It is in our field of interest and the things we learned will definitely help us in the future.
1. Case overview of our Data Science project
The first step is to determine the problem and make a list of everything you have. Also try to recognize which might be the biggest obstacles on the way.
The main objective of the Article recommender case, is to optimize the suggestions to the readers of articles online. The case has the final goal of engaging the user with topics which are the closest to his points of interest.
The main idea of the case is to predict the next best article for the visitor. So that the visitor can follow the most important for him topics without searching exclusively for them.
The evaluation of the model will be for the same users using the data for the next time period. The training dataset is for the past 30 days. The articles are a lot and the visitors are even more. Total data points are over 500,000,000. Evaluation dataset is going to be for the next 1 day. One specific problem is the fact that all of the text in the articles is in Bulgarian(with cyrillic alphabet).
2. Main interest
Does it actually fit your points of interest and is it the thing you want? If not, consider changing it.
For us this is important because we work with different webpages, some of them having blogs or a whole article section and we see this as an opportunity for them to grow their engagement and involve their audience into the process.
Particularly this case we find interesting because of the way the main focus is moved from the big amount of views to actually the personal preferences of the visitor of the website.
Fortunately, using these steps we managed to find a great working strategy.
The next and most important step whenever you begin a new project – even before starting with the specific plan is the research.
In our case the top priority was to research different types of recommenders, which already exist and are used worldwide. Especially a long time took to find a specific model, which uses the cyrillic alphabet. We chose one for the Russian language.
4. Data exploration
Next important step is to go back to the problem and analyze in details what exactly you have and how you can use it in order to achieve the final goal. If you have the option gather even more data or knowledge.
In our case after choosing the model, we realized that only the given data is not enough to make a really good and personalized recommender. So we took the risk and decided to use even more of our time in order to take more data. We did this through scraping the website_link, title, subtitle, text, date_of_posting and hashtags. It was not clear which one will describe the article the best. We can compare all of them and choose one when needed.
Developing a stable and well thought out architecture you’d better have a brainstorming session. It is needed in order to explore the best options you have and the different ways to achieve the goal.
We had a brainstorming session in which we decided to represent the article into a vector, because this way all of the articles can be separated in different clusters. In order to do this, we chose the model BERT.
The user is the other important part of the equation. Every visitor has specific points of interest in given clusters of articles. So in order to crystallize the patterns of the person and how it changes, we decided to use Recurrent Neural Networks. So our architecture included BERT and RNN.
6. Evaluation metrics
One of the most important things after developing the architecture is to find the best way to test it all and how to measure the results.
We decided to make some hyperparameters in the architecture and chose Mean Reciprocal Rank (MRR) and Mean Average Precision (MAP) as our evaluation metrics.
After deciding everything is time for the hard work and the coding itself. The project can be divided on different smaller tasks and make it possible for everyone to work on parallel.
In our case one person was scraping the data, another one with the BERT model and another one on the RNN model. This helped us do as best as we could in the given amount of time. Once the scraping was ready, we could directly put it into BERT. And the result from BERT directly into the RNN.
Do not forget to use the evaluation metrics you chose on step 6. Depending on the results go back to any of the previous steps in order to build a better model.
In order to evaluate our model, we used the last two hour from the 48 hours Datathon. Of course not everything was perfect, but for the given time we saw where are the problems. At least we could present them and at least have the idea how the model can be better.
After all the evaluation is passed, it is time for the final testing. With the testing the goal is to actually put the model in a real life situation and wait for the results to come.
Our testing was made from the experts in the Datathon and the results were quite satisfying for the short time we were given.
And last but not least: don’t forget to have fun and be happy with everything you are doing!
If you have some questions on this or another topic of ours, feel free to write us! We will be happy to answer!
We are open for business inquiries. And if you are interested in our work, follow us for more interesting topics and be part of our next projects!