Risks today are either prominent and well-managed or emerging and misunderstood. The ‘middle’ of risk is disappearing, and gathering comprehensive data from new sources is now central to an insurer's ability to price effectively.
Back in September, we led a workshop at Pycon UK titled, ‘Natural Language Processing in 10 Lines of Code’. Due to its popularity, we decided to share the tutorial here.
In our previous post, we used some basic techniques to analyse Pride and Prejudice and extract some interesting insights about the characters in the book. In this post, we are going to apply the same analysis techniques to a dataset of real events; the RAND Terrorism dataset - a collection of 40,000 news articles collected from 1968 to 2009 reporting on terrorist activity.
Here are some questions we are going to try to answer during this analysis:
Who are the terrorist groups and other persons mentioned in each article?
What locations are mentioned in each article? Hint: a location just has a different label to a person?
With all of this information, is it possible to plot a figure expressing the relationships between locations and terrorists?
You can find instructions on how to install everything needed for this tutorial in the workshop repository on the Cytora Github.
For speed we have preprocessed the dataset for this task, reducing it to 10033 articles and removing extraneous features of the dataset.
THINGS TO CONSIDER WHEN USING REAL DATA
We are going to use the exact same approach to analyse this dataset as we used in our last post to analyse Pride and Prejudice. However, when using real data there can be some pitfalls and quirks to consider due to inconsistent data quality.
In the previous task, we only used personal named entities to identify characters. For this task, you will need to expand your selection to groups and organisations using the spaCy “ORG” label. Don’t assume this will only give you terrorist groups, as the UN is mentioned many times in this dataset.
Another factor to consider is that a terrorist group might be named differently depending on the author of the article. ‘Al-Qaeda’, ‘Al Qaeda’ and ‘Alqaeda’ all appear in this dataset. We know that these names all refer to the same entity, but spaCy does not. If you want to group together similar names for more consistent data, you could do so using pattern replacement methods.
By extracting the mentions and locations of particular terrorist groups from each article, we can examine terrorist activity by location to understand the risks posed by a certain group over the landscape we are interested in.
GLOBAL INCIDENTS BY TERRORIST GROUP
Using Seaborn, we can create this visualisation of the output of our analysis:
From this visualisation we can extract some key insights;
The Taliban is mentioned in relation to Afghanistan over 1000 times, and the capital city of Kabul 155 times
Hamas is linked closest to Gaza and Israel
Hamas and Palestine are frequently mentioned together, but not as frequently as in relation to Israel
As this dataset only covers 1968 to 2009, Islamic State (ISIS) is not mentioned
If we look at the first column compared to all others, we can see that Al-Qaeda have the widest spread of mentions, being identified in 12 of the 13 areas that we chose to inspect
Despite Al-Qaeda originating in Afghanistan, the 3 highest locations mentioned in relation to it are all in Iraq, or the Iraq itself
As this dataset is spread across a large chunk of time (1968 to 2009), we could infer that Al-Qaeda were not reported as a terrorist group, at least in this dataset, until their activity in Iraq in the last 15 years
It is important to keep in mind that there could be potential data bias from the curators of this dataset, for example, a US-based nonprofit group may have vested interests in a particular inference from this data.
In this analysis, we did not consider adding the subgroups and offshoots of each terrorist group. Doing so might yield a more accurate representation of activity. We could also use the raw unprocessed data, which includes the date for each report, to slice the articles by decade, creating a series of heat maps which analyse terrorist group mentions over time.
In an earlier post, we suggested that it was possible for an AI system to assess your eligibility for commercial insurance and issue you with a price in a matter of seconds. In this post, we will explain how this can be achieved for commercial property insurance using third-party data and the information available online.
In this blog post, the first in a series of three, we will outline our working methodology of hexagonal architecture, and aim to demystify some of the concepts surrounding its implementation.
Almost every significant event that takes place in the world is being recorded in remarkably high resolution; a bomb going off, a car crash, an aeroplane landing. Due to the unrelenting wave of connectivity, these digital observations are being published on the internet.
Historically, there was a huge gap between an insurer's estimate of a risk and the manifestation of the risk in reality. As more data has become available, this gap has begun to close.
What is machine learning and how it can be used in an insurance context?
A triples algorithm is a machine learning technique that enables us to reduce any sentence to three words. For researchers and analysts who deal with web data, this is a powerful tool.