The ICO Data Superset ID
One of ICO Data Solution's unique benefits is the ICO Data Superset ID. This ID links every piece of data to one, unique identifier regardless of the data source. In order to create such an ID we use intelligent matching and merging. How does this work? Let's say we have three data sources <A>, <B> and <C>. The first step we take is to clean and normalize all data so that we have the same type of data in the same format. Strings, amounts, dates, and booleans. But also categories, statuses, and classes. After that we apply intelligent matching and merging. We do this for each data source individually, as well as for each merged data set.
We use a set of data attributes to determine if there are multiple data entries for the same ICO. One of these attributes is for example the ICO website URL. We strip the website URL to its bare domain and use this value to compare it to other entries. Same domain is a very good indicator for a match.
Another attribute we use for matching is the project/ICO name. Here a one to one match is not sufficient. Therefor we use so called fuzzy matching to determine the distance between two strings. The smaller the distance, the more likely we have a match. Besides these two we use other attributes as well.
Each evaluated attribute that results in an attribute match adds to an ICO match score, and when a certain ICO match score threshold is reached we identify the two ICOs as a match.
After we have our list of ICO matches the intelligent merging can be begin.
When I want to merge sources <A>, <B> and <C> the easiest would be of course if data set <A> would only contain attributes that are not in <B>, and <B> contains only attributes that are not in <C>. This would be easier, but would not help us with getting better quality data. We like to have multiple sources for the same attribute, so we can merge intelligently.
For each attribute we have defined the list of sources that could provide this attribute when the value is not missing. Then we created for each attribute a ranking of the sources. This means the higher a source is, the more we believe this source has a correct value.
When we now merge <A>, <B> and <C> we look for each attribute if there is a value present in each of the sources and apply the ranking to determine the weight of this value. The value that has the most weight will be in the final data set.
After we have matched and merged all available data sources each row will get's its ICO Data Superset ID. Since we still have the keys of the linked original data sources we are able to process updates of all individual data sources very easily.