Work Hours
Everyday: 北京时间8:00 - 23:59
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 1 of 6
Record Linkage
Due: Friday, Feb 26th at 4pm
You must work alone on this assignment.
In this assignment, you will take a pair of datasets containing restaurant names and addresses and link
them, i.e., find records in the two datasets that refer to the same restaurant. This task is non-trivial when
there are discrepancies in the names and addresses of the same restaurants across the two data sources.
This assignment will give you experience in probabilistic record linkage, a technique that you may find
valuable for your project and when working with real-world data in the future.
Here is an overview of the tasks you will perform. Some of these will make more sense after you have read
the full description.
- The restaurant datasets are tables of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a
pandas dataframe for each dataset consisting of four columns: (i) the sequentially-numbered index
from the file, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the
dataframes zagat and fodors. - For record linkage, you will use a training dataset of known links, available in the file
known_links.csv. For a given row, the first column indicates the index of a restaurant in the Zagat
dataset, and the second column indicates the index of the same restaurant in the Fodor’s dataset. Your
next task is to read this dataset into a pandas dataframe and generate a dataframe of the three string
(non-index) columns from the Zagat dataset, and the three columns from the Fodor’s dataset, for the
same restaurants. Call a pair from the known links a match, and the dataframe matches. - You will also need a dataset of pairs that are not matches. We will call such a pair an unmatch. You will
create this by choosing random pairs from zagat and fodors. Call this the unmatches dataframe. (The
terms match and unmatch follow the convention in record linkage. ) - You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records
(one each from zagat and fodors). We have provided a set of histograms that show the distribution of these scores, for each of the fields (name, city, and address), for matches and for unmatches.
You should examine these histograms to understand the rationale for the algorithm we will adopt. - To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this
part you will represent the distances between the fields of a pair of restaurants as a tuple of three values. Each field will be rated as having a “low”, “medium”, or “high” similarity across the two datasets.
(It turns out that categorizing the similarity values into three buckets works better than the two that
would result from a simple cutoff.) Since each tuple has three slots, and each can be one of three values, there are 27 possible tuples. Your next task is to count the occurrence of each tuple for the matches and the unmatches. Here you will use the matches and unmatches dataframes constructed earlier. - Next, you will place the 27 different possible tuples into three categories: match_tuples,
possible_tuples, and unmatch_tuples. If a test pair maps to a tuple in the first set, we classify it as a
match; if it maps to a tuple in the second set, we classify it as a possible match; else we classify it as an
unmatch. (We will trust the decisions made by our algorithm for the clear matches and unmatches, but
possible matches need to be inspected manually.) - It will then be time to find matches. You will iterate through all potential pairs from the two data sets,
for each rank the similarities of the three fields and generate the corresponding tuple, and then deter-
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 2 of 6
mine the category to which it belongs Based on the set you will mark it as a match, possible match, or
unmatch. You will print the number of each type you find, and also return all matches, possible matches, and unmatches from your overall function. - Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the
above algorithm, but only consider potential pairs that also have the same value for city.
Create restaurant dataframes
The restaurant datasets are tables of names, cities, and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each consisting of four columns: i) the index number from the csv file, (ii) the restaurant
name, (iii) the city, and (iv) the address (except for the city).
the review companies actually provide the restaurant information as a single string. We have split this
string into the name, city, and address using regular expressions for you. Because an automated process
was used, there may be restaurants for which this extraction process was performed imperfectly. This is
not an ideal situation, but it is realistic, and the data is in sufficiently good shape to use for the probabilistic algorithm we will employ.
It is best not to name the restaurant name column name, because pandas dataframes have a predefined
attribute called name, and you get unexpected results if you use the dataframename.columnname style
of referring to a column.
Create the matches dataframe
For record linkage, you will use a training dataset of known links, available in the file known_links.csv.
It contains 50 manually-linked pairs of restaurants. In other words, a human has chosen some of the rows
in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair
is from Zagat and the second from Fodor’s. Rows in the file correspond to these pairs, and the two columns give the indices for a restaurant within the corresponding files.
Using this file and the other dataframes, create a new dataframe called matches with six columns: the
name, city, and address from Zagat, and the corresponding three fields from Fodor’s.
Create the unmatches dataframe
It turns out that for record linkage, you also need a dataset of unmatches. For this, choose 1000 random
restaurants from zagat and pair them with 1000 random restaurants from fodors. To make random
choices, the pandas sample function for sampling a dataframe is suitable.
Technically, the random pairs may include some pairs that actually match, but the overwhelming number
will be unmatches, and we will ignore this minor source of error.
Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can,
and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches
would also have been helpful as well, but that requires expensive manual work.
The random choices you make will influence the rest of your results. This makes it hard to debug a pro-
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 3 of 6
gram if the bug only shows up for some random choices. Furthermore, making the same random choices
in your program as we do in ours, makes it easier for us to check your results. Therefore, we want you to
seed the random generator with a fixed number. Specifically, use the following calls to choose the samples:
These calls will each always return the same 1000 rows because the random state is fixed through the
seeds 1234 and 5678. In normal circumstances, once your code is working, you would remove these hardcoded seeds. But, we ask that you leave them in to facilitate testing of your submitted code.
Explore Jaro-Winkler scores in histograms
To link records, we take a potential pair, one each from zagat and fodors, and decide if their strings are
sufficiently similar. Our results are better when the strings are broken up into natural components (like
name, city, and address), which we have done for you.
When the components are strings, a natural measure of similarity, or rather the distance between two
strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number
of single-character insertions, deletions, and substitutions required to convert one string to another. In
practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the
Jaro-Winkler distance is used instead.
The exact definition of the Jaro-Winkler distance is somewhat technical (but for those interested, available
here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”)
has a score of 0.
You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install
it using the standard:
To find the Jaro-Winkler distance, use, e.g.,:
To make sense of the algorithm we will be using for record linkage, you need to gain an understanding of
the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants,
both when the pairs are matches and when they are unmatches.
Here is a link to a PDF file containing six histograms, labelled appropriately. The first pair of histograms
is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches.
The second and third are similar, except for cities and addresses.
Estimate Tuple probabilities given match or unmatch
Observe that the probability of a match does not increase monotonically with the similarity for a field;
zs = zagat.sample(1000, replace = True, random_state = 1234)
fs = fodors.sample(1000, replace = True, random_state = 5678)
pip3 install –user –upgrade jellyfish
import jellyfish
score = jellyfish.jaro_winkler(“medici”, “noodles etc.”)
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 4 of 6
rather, there are clusters in which this probability is high. For instance, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for
the unmatch pairs.
The above implies that a single “cut-off” on the similarity for a field, or even a group of fields, will not serve
our purpose. So we break-up the range of similarity values into discrete chunks, which we will call “low”,
“medium”, and “high”, and determine the probability of a match for each combination of field-wise
chunks separately.
We have adopted (a simplified version of) the probabilistic record linkage approach proposed by Fellegi
and Sunter. Provided in utils.py is a simple utility function get_jw_category() that takes a Jaro-Winkler distance and returns the string “low”, “medium”, or “high”, essentially breaking the range of the JaroWinkler score into three blocks. We apply this function to all three fields: names, cities, and addresses.
Thus for any pair of restaurants, we can create a tuple (name_category, city_category, address_category)
that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be
classified as a match or unmatch, or something in between, will depend on whether that tuple was most
closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.
Specifically, we determine whether a tuple should be classified as a match, possible match, or unmatch,
based on estimates of the probability that a matched pair results in that tuple as well as the probability
that an unmatched pair results in that tuple. Formally, assume that is a potential pair, is
the tuple formed from its field similarities, and is the set of all possible tuples. For every we need
estimates for two quantities:
Compute estimates for the former by iterating through all the pairs in matches, determining their tuples,
and counting the frequency of each of the 27 possible tuples (combinations of the three similarity levels in
the three different positions) during this process. The relative frequency for a tuple is the estimate of the
probability for given a matching pair. Similarly find an estimate for the probability given an unmatch by
iterating through tuples for all pairs in unmatches.
Partition Tuples into match, possible match, and unmatch
sets
How should we decide which tuples belong in which of the three different categories: match_tuples,
possible_tuples, and unmatch_tuples? It depends on our tolerance for two kinds of errors:
false positive rate, the probability that we classify an actual unmatch as a match, and
false negative rate, the probability that we classify an actual match as an unmatch.
Assume that is the maximum false positive rate and is the maximum false negative rate we are willing
to accept, and, given these are satisfied, we want to maximize the number of our matches.
In order to classify which tuples should be associated with matches, unmatches, and possible matches, we
do the following— - Assign all tuples that have and to possible_tuples. Intuition: We have not
(푟1, 푟2) 푡(푟1, 푟2)
푇 푤 ∈ 푇
푚(푤) defined as P[푡(푟1, 푟2) = 푤 | (푟1, 푟2) is a match], and
푢(푤) defined as P[푡(푟1, 푟2) = 푤 | (푟1, 푟2) is an unmatch].
푤
푤
휇 휆
푤 푚(푤) = 0 푢(푤) = 0
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 5 of 6
seen either matches or unmatches with these similarity levels in our training data, so our algorithm
cannot make any inferences about them, and human intervention will be required. - Sort the rest of the 27 tuples such that the tuples with are in front of the list, and the remaining are in decreasing (non-increasing) order of . Intuition: We want a ranked order of
how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the
ratio of how often they have been associated with matches, versus the number of times they have been
associated with unmatches, and put the ones most likely to be associated with matches earlier. - Starting at the first tuple in this sorted order, find the last tuple such that the cumulative sum of
values (for from first to ) is at most . Add all tuples from the first till (inclusive) to
match_tuples. Intuition: We want to limit ourselves to no more than a specified false positive rate ( ).
We have sorted the tuples in order of their being most likely to be associated with matches. We take as
many as we can, starting with the best ones, until we are at the point where taking another one would
cause us to be likely to exceed our false positive rate because the likelihood that a pair is actually an
unmatch has become too high. - Starting at the last tuple (the first tuple in reverse order), find the furthest-from-last tuple such
that the cumulative sum of values is at most . Add tuples from the last till (inclusive) to
unmatch_tuples. Intuition: We choose the worst tuples, in terms of being likely to be associated with a
match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false
negative rate, so we stop just before the point where this would happen. - Add any remaining tuples (in the ones that were sorted) to possible_tuples. Intuition: We are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be
looked at by a human.
There is a subtlety in this description: the above only works when happens to be ahead of . If they
“cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or
unmatches, and either way, we would not exceed our targets for false positives or false negatives.
What should we do, if faced with this situation? It comes down to our priorities. If we want to prioritize
having more matches (and we do), then we should count all of these ambiguous tuples as being associated
with matches. Remember, in this scenario, this is okay, since we still will not exceed our false positive rate
in the process.
Find matches, possible matches, and unmatches
Once you have the three sets for given maximum false positive and false negative rates, simply iterate
through every pair in zagat and fodors to create three DataFrames of matches, possible matches, and unmatches. All three DataFrames should have six columns: the name, city, and address from Zagat, followed
by the name, city, an address from Fodor’s. Return these three DataFrames from the find_matches function, in the order above.
Block on City
Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same
value for a blocking key. Implement a version of the record linking algorithm that only considers the
푢(푤) = 0
푚(푤)/푢(푤)
푤1
푢(푤) 푤 푤1 휇 푤1
휇
푤2
푚(푤) 휆 푤2
푤1 푤2
Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 6 of 6
restaurants that have the exact same city in both datasets.
This version will be extremely similar to the previous one; thus, you should have one implementation that
simply takes whether to block on city, or not, as a parameter, and proceeds accordingly.
Note: Do not change how you assess the training data in the version of the algorithm that performs blocking. Compute the frequencies of different tuples, and the partitioning of the 27 tuples, exactly as before.
The only change should be that, when you iterate through all pairs of restaurants, you disregard any pair
where the two entries do not have identical cities.
Task checklist
Write Python code, with appropriate auxiliary functions and documentation that implements the following
function:
which takes mu, the maximum false positive rate, lambda_, the maximum false negative rate, and
block_on_city a boolean that indicates whether to block on the city or not. It should return a 3-tuple containing DataFrames with matches, possible matches, and unmatches it found, in that order.
In particular, ensure you— - Create zagat, fodors, matches, and unmatches.
- Determine relative frequencies for 27 tuples corresponding to pairs from matches and unmatches.
- sort tuples and create sets match_tuples, possible_tuples, and unmatch_tuples.
- generate matches, possible matches, and unmatches for all potential pairs from zagat and fodors,
possibly after blocking on city. - Return the generated matches, possible matches, and unmatches in DataFrames.
Getting started
We have seeded your repository with a directory for this assignment. To pick it up, change to your
cmsc12200-win-21-username directory (where the string username should be replaced with your username), run git pull to make sure that your local copy of the repository is in sync with the server, and
then run git pull upstream master to pick up the distribution.
Submission
Follow the usual submission instructions.
Acknowledgement
This assignment was originally designed and written by Amitabh Chaudhary.
find_matches(mu, lambda_, block_on_city)
https://www.classes.cs.uchicago.edu/archive/2022/winter/12200-1/