Part 2 | Hammer Goldmine | Implementation
Last week we explained that our desire is to build an internal data warehouse that stays up to date without any human assistance. Such a data warehouse will be of great assistance in finding valuable information for our market insights and finding that information faster. But before this data warehouse can assist us, several steps in the processing pipeline must be automated for it to run smoothly.
First, data must be collected from the relevant sources periodically. Determining which types of data to use in a database is essential to the success of the implementation.
We have decided (for the time being) to focus on unstructured data in the form of news articles. The amount of news output today is almost unlimited.
During our projects we do not want to stay limited to a few news sources, but we want to explore the world of information available to find hidden gems that help give us deeper insights during our projects. Traditionally, this is very time consuming. We expect that processing this type of data will give us the largest time gain during a qualitative analysis.
Once the data is collected, it must be stored properly so that it can be retrieved quickly whenever we need it in one of our projects. An important consideration when storing the data is how to standardise the data. Standardisation is a double-edged sword: Applying low standardisation makes it difficult to run different data entries through the same pipeline but keeps the data quality high and the footprint small.
Since we are working with unstructured data, we are building a very adaptable model anyway. This means we can apply a low level of standardisation and still use a tight pipeline.
When the data is stored properly, we can start to focus on how we want to annotate the data with information that might help us to identify data entries that are relevant to a specific project. For now, we have decided to do this by categorising the data into bins that contain data entries related to a specific market intelligence field. This will allow us to retrieve any data that is relevant to the most pressing information need at any point during the project.
These labels are added with an active learning approach. During the development phase, data is annotated by an expert before it is run through a machine learning model. After each iteration, the data is evaluated by the expert before it is run through the model again. Using this approach, the model will get an understanding of the labels we want to use in our data warehouse. The goal of this approach is to maximize the model's workload and minimize the expert's workload. This makes sure that the process of adding a new label to the data can happen in a brief period. Eventually the expert can be left out of the process completely and the model will label new data on its own. With this approach we have successfully added several labels to the data warehouse and are continuously adding new ones.
In the next few weeks, we will highlight the individual parts of the Hammer Goldmine pipeline and how we expect them to improve our workflow. Next blog we will discuss part 3 of Hammer Goldmine, the usability of the database and web interface.