CS122 CMSC12200 Computer Science with Application 2 | python assignment代写 | UChicago芝加哥大学

Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 1 of 6
Record Linkage
Due: Friday, Feb 26th at 4pm
You must work alone on this assignment.
In this assignment, you will take a pair of datasets containing restaurant names and addresses and link
them, i.e., find records in the two datasets that refer to the same restaurant. This task is non-trivial when
there are discrepancies in the names and addresses of the same restaurants across the two data sources.
This assignment will give you experience in probabilistic record linkage, a technique that you may find
valuable for your project and when working with real-world data in the future.
Here is an overview of the tasks you will perform. Some of these will make more sense after you have read
the full description.

  1. The restaurant datasets are tables of names and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a
    pandas dataframe for each dataset consisting of four columns: (i) the sequentially-numbered index
    from the file, (ii) the restaurant name, (iii) the city, and (iv) the address (except for the city). Call the
    dataframes zagat and fodors.
  2. For record linkage, you will use a training dataset of known links, available in the file
    known_links.csv. For a given row, the first column indicates the index of a restaurant in the Zagat
    dataset, and the second column indicates the index of the same restaurant in the Fodor’s dataset. Your
    next task is to read this dataset into a pandas dataframe and generate a dataframe of the three string
    (non-index) columns from the Zagat dataset, and the three columns from the Fodor’s dataset, for the
    same restaurants. Call a pair from the known links a match, and the dataframe matches.
  3. You will also need a dataset of pairs that are not matches. We will call such a pair an unmatch. You will
    create this by choosing random pairs from zagat and fodors. Call this the unmatches dataframe. (The
    terms match and unmatch follow the convention in record linkage. )
  4. You will use the Jaro-Winkler distance to measure the similarity between a field in a pair of records
    (one each from zagat and fodors). We have provided a set of histograms that show the distribution of these scores, for each of the fields (name, city, and address), for matches and for unmatches.
    You should examine these histograms to understand the rationale for the algorithm we will adopt.
  5. To link records, you will use (a simplified version of) the probabilistic record linkage approach. In this
    part you will represent the distances between the fields of a pair of restaurants as a tuple of three values. Each field will be rated as having a “low”, “medium”, or “high” similarity across the two datasets.
    (It turns out that categorizing the similarity values into three buckets works better than the two that
    would result from a simple cutoff.) Since each tuple has three slots, and each can be one of three values, there are 27 possible tuples. Your next task is to count the occurrence of each tuple for the matches and the unmatches. Here you will use the matches and unmatches dataframes constructed earlier.
  6. Next, you will place the 27 different possible tuples into three categories: match_tuples,
    possible_tuples, and unmatch_tuples. If a test pair maps to a tuple in the first set, we classify it as a
    match; if it maps to a tuple in the second set, we classify it as a possible match; else we classify it as an
    unmatch. (We will trust the decisions made by our algorithm for the clear matches and unmatches, but
    possible matches need to be inspected manually.)
  7. It will then be time to find matches. You will iterate through all potential pairs from the two data sets,
    for each rank the similarities of the three fields and generate the corresponding tuple, and then deter-
    Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
    https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 2 of 6
    mine the category to which it belongs Based on the set you will mark it as a match, possible match, or
    unmatch. You will print the number of each type you find, and also return all matches, possible matches, and unmatches from your overall function.
  8. Iterating through all possible pairs is time consuming in general. In the final step, you will repeat the
    above algorithm, but only consider potential pairs that also have the same value for city.
    Create restaurant dataframes
    The restaurant datasets are tables of names, cities, and addresses provided by the restaurant review companies Zagat and Fodor’s, and are in the files zagat.csv and fodors.csv. Your first task is to create a pandas dataframe for each consisting of four columns: i) the index number from the csv file, (ii) the restaurant
    name, (iii) the city, and (iv) the address (except for the city).
    the review companies actually provide the restaurant information as a single string. We have split this
    string into the name, city, and address using regular expressions for you. Because an automated process
    was used, there may be restaurants for which this extraction process was performed imperfectly. This is
    not an ideal situation, but it is realistic, and the data is in sufficiently good shape to use for the probabilistic algorithm we will employ.
    It is best not to name the restaurant name column name, because pandas dataframes have a predefined
    attribute called name, and you get unexpected results if you use the dataframename.columnname style
    of referring to a column.
    Create the matches dataframe
    For record linkage, you will use a training dataset of known links, available in the file known_links.csv.
    It contains 50 manually-linked pairs of restaurants. In other words, a human has chosen some of the rows
    in one dataset, and determined the corresponding rows in the other dataset. The first restaurant in the pair
    is from Zagat and the second from Fodor’s. Rows in the file correspond to these pairs, and the two columns give the indices for a restaurant within the corresponding files.
    Using this file and the other dataframes, create a new dataframe called matches with six columns: the
    name, city, and address from Zagat, and the corresponding three fields from Fodor’s.
    Create the unmatches dataframe
    It turns out that for record linkage, you also need a dataset of unmatches. For this, choose 1000 random
    restaurants from zagat and pair them with 1000 random restaurants from fodors. To make random
    choices, the pandas sample function for sampling a dataframe is suitable.
    Technically, the random pairs may include some pairs that actually match, but the overwhelming number
    will be unmatches, and we will ignore this minor source of error.
    Why are we using 1000 random pairs for unmatches, when we only have 50 for matches? Because we can,
    and the more pairs we have, the better the accuracy of our record linkage. Of course, having more matches
    would also have been helpful as well, but that requires expensive manual work.
    The random choices you make will influence the rest of your results. This makes it hard to debug a pro-
    Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
    https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 3 of 6
    gram if the bug only shows up for some random choices. Furthermore, making the same random choices
    in your program as we do in ours, makes it easier for us to check your results. Therefore, we want you to
    seed the random generator with a fixed number. Specifically, use the following calls to choose the samples:
    These calls will each always return the same 1000 rows because the random state is fixed through the
    seeds 1234 and 5678. In normal circumstances, once your code is working, you would remove these hardcoded seeds. But, we ask that you leave them in to facilitate testing of your submitted code.
    Explore Jaro-Winkler scores in histograms
    To link records, we take a potential pair, one each from zagat and fodors, and decide if their strings are
    sufficiently similar. Our results are better when the strings are broken up into natural components (like
    name, city, and address), which we have done for you.
    When the components are strings, a natural measure of similarity, or rather the distance between two
    strings, is the edit distance, also known as the Levenshtein distance. It is defined as the minimum number
    of single-character insertions, deletions, and substitutions required to convert one string to another. In
    practice, however, the Levenshtein distance is computationally intensive, and a related measure, called the
    Jaro-Winkler distance is used instead.
    The exact definition of the Jaro-Winkler distance is somewhat technical (but for those interested, available
    here). Also, although it is called a distance, it actually measures the similarity between two strings: a complete match (“Medici” and “Medici”) has a score of 1, and a complete mismatch (“Medici” and “Subway”)
    has a score of 0.
    You will use the Python library jellyfish to compute the Jaro-Winkler distance between two strings. Install
    it using the standard:
    To find the Jaro-Winkler distance, use, e.g.,:
    To make sense of the algorithm we will be using for record linkage, you need to gain an understanding of
    the distribution of the Jaro-Winkler distance between names, cities, and addresses of pairs of restaurants,
    both when the pairs are matches and when they are unmatches.
    Here is a link to a PDF file containing six histograms, labelled appropriately. The first pair of histograms
    is for the Jaro-Winkler distances between names for restaurants, one each for matches and unmatches.
    The second and third are similar, except for cities and addresses.
    Estimate Tuple probabilities given match or unmatch
    Observe that the probability of a match does not increase monotonically with the similarity for a field;
    zs = zagat.sample(1000, replace = True, random_state = 1234)
    fs = fodors.sample(1000, replace = True, random_state = 5678)
    pip3 install –user –upgrade jellyfish
    import jellyfish
    score = jellyfish.jaro_winkler(“medici”, “noodles etc.”)
    Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
    https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 4 of 6
    rather, there are clusters in which this probability is high. For instance, for the address field, the probability is lower in the range 0.80-0.95 than just outside this range. The clusters are even more pronounced for
    the unmatch pairs.
    The above implies that a single “cut-off” on the similarity for a field, or even a group of fields, will not serve
    our purpose. So we break-up the range of similarity values into discrete chunks, which we will call “low”,
    “medium”, and “high”, and determine the probability of a match for each combination of field-wise
    chunks separately.
    We have adopted (a simplified version of) the probabilistic record linkage approach proposed by Fellegi
    and Sunter. Provided in utils.py is a simple utility function get_jw_category() that takes a Jaro-Winkler distance and returns the string “low”, “medium”, or “high”, essentially breaking the range of the JaroWinkler score into three blocks. We apply this function to all three fields: names, cities, and addresses.
    Thus for any pair of restaurants, we can create a tuple (name_category, city_category, address_category)
    that represents the similarity of the fields for that pair. Whether, during record linkage, the pair should be
    classified as a match or unmatch, or something in between, will depend on whether that tuple was most
    closely associated with matches or unmatches when we trained the algorithm, and our tolerance for error.
    Specifically, we determine whether a tuple should be classified as a match, possible match, or unmatch,
    based on estimates of the probability that a matched pair results in that tuple as well as the probability
    that an unmatched pair results in that tuple. Formally, assume that is a potential pair, is
    the tuple formed from its field similarities, and is the set of all possible tuples. For every we need
    estimates for two quantities:
    Compute estimates for the former by iterating through all the pairs in matches, determining their tuples,
    and counting the frequency of each of the 27 possible tuples (combinations of the three similarity levels in
    the three different positions) during this process. The relative frequency for a tuple is the estimate of the
    probability for given a matching pair. Similarly find an estimate for the probability given an unmatch by
    iterating through tuples for all pairs in unmatches.
    Partition Tuples into match, possible match, and unmatch
    sets
    How should we decide which tuples belong in which of the three different categories: match_tuples,
    possible_tuples, and unmatch_tuples? It depends on our tolerance for two kinds of errors:
    false positive rate, the probability that we classify an actual unmatch as a match, and
    false negative rate, the probability that we classify an actual match as an unmatch.
    Assume that is the maximum false positive rate and is the maximum false negative rate we are willing
    to accept, and, given these are satisfied, we want to maximize the number of our matches.
    In order to classify which tuples should be associated with matches, unmatches, and possible matches, we
    do the following—
  9. Assign all tuples that have and to possible_tuples. Intuition: We have not
    (푟1, 푟2) 푡(푟1, 푟2)
    푇 푤 ∈ 푇
    푚(푤) defined as P[푡(푟1, 푟2) = 푤 | (푟1, 푟2) is a match], and
    푢(푤) defined as P[푡(푟1, 푟2) = 푤 | (푟1, 푟2) is an unmatch].


    휇 휆
    푤 푚(푤) = 0 푢(푤) = 0
    Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
    https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 5 of 6
    seen either matches or unmatches with these similarity levels in our training data, so our algorithm
    cannot make any inferences about them, and human intervention will be required.
  10. Sort the rest of the 27 tuples such that the tuples with are in front of the list, and the remaining are in decreasing (non-increasing) order of . Intuition: We want a ranked order of
    how likely a tuple is to be associated with a match. If we have never seen a tuple associated with an unmatch in our training data, then we will assume that any pair with the same tuple is a match. We prioritize these “especially good” tuples in our sorted order. After those, we rank the tuples in terms of the
    ratio of how often they have been associated with matches, versus the number of times they have been
    associated with unmatches, and put the ones most likely to be associated with matches earlier.
  11. Starting at the first tuple in this sorted order, find the last tuple such that the cumulative sum of
    values (for from first to ) is at most . Add all tuples from the first till (inclusive) to
    match_tuples. Intuition: We want to limit ourselves to no more than a specified false positive rate ( ).
    We have sorted the tuples in order of their being most likely to be associated with matches. We take as
    many as we can, starting with the best ones, until we are at the point where taking another one would
    cause us to be likely to exceed our false positive rate because the likelihood that a pair is actually an
    unmatch has become too high.
  12. Starting at the last tuple (the first tuple in reverse order), find the furthest-from-last tuple such
    that the cumulative sum of values is at most . Add tuples from the last till (inclusive) to
    unmatch_tuples. Intuition: We choose the worst tuples, in terms of being likely to be associated with a
    match, and assign them to be used for unmatches. If we overdo this, then we might exceed our false
    negative rate, so we stop just before the point where this would happen.
  13. Add any remaining tuples (in the ones that were sorted) to possible_tuples. Intuition: We are unwilling to let an algorithm make any decisions about pairs that have these tuples, because we have already used up our false positive and false negative budget. Any such restaurant pairs will need to be
    looked at by a human.
    There is a subtlety in this description: the above only works when happens to be ahead of . If they
    “cross over” instead, then this means that some of the tuples could sensibly be used to assign matches or
    unmatches, and either way, we would not exceed our targets for false positives or false negatives.
    What should we do, if faced with this situation? It comes down to our priorities. If we want to prioritize
    having more matches (and we do), then we should count all of these ambiguous tuples as being associated
    with matches. Remember, in this scenario, this is okay, since we still will not exceed our false positive rate
    in the process.
    Find matches, possible matches, and unmatches
    Once you have the three sets for given maximum false positive and false negative rates, simply iterate
    through every pair in zagat and fodors to create three DataFrames of matches, possible matches, and unmatches. All three DataFrames should have six columns: the name, city, and address from Zagat, followed
    by the name, city, an address from Fodor’s. Return these three DataFrames from the find_matches function, in the order above.
    Block on City
    Iterating through all potential pairs is computationally prohibitive for large datasets. So the number of potential pairs is often reduced by applying a constraint, such as only considering pairs that have the same
    value for a blocking key. Implement a version of the record linking algorithm that only considers the
    푢(푤) = 0
    푚(푤)/푢(푤)
    푤1
    푢(푤) 푤 푤1 휇 푤1

    푤2
    푚(푤) 휆 푤2
    푤1 푤2
    Record Linkage — CS122 1.0 documentation 2/23/21, 12:33 AM
    https://www.classes.cs.uchicago.edu/archive/2021/winter/12200-1/pa/pa4/index.html Page 6 of 6
    restaurants that have the exact same city in both datasets.
    This version will be extremely similar to the previous one; thus, you should have one implementation that
    simply takes whether to block on city, or not, as a parameter, and proceeds accordingly.
    Note: Do not change how you assess the training data in the version of the algorithm that performs blocking. Compute the frequencies of different tuples, and the partitioning of the 27 tuples, exactly as before.
    The only change should be that, when you iterate through all pairs of restaurants, you disregard any pair
    where the two entries do not have identical cities.
    Task checklist
    Write Python code, with appropriate auxiliary functions and documentation that implements the following
    function:
    which takes mu, the maximum false positive rate, lambda_, the maximum false negative rate, and
    block_on_city a boolean that indicates whether to block on the city or not. It should return a 3-tuple containing DataFrames with matches, possible matches, and unmatches it found, in that order.
    In particular, ensure you—
  14. Create zagat, fodors, matches, and unmatches.
  15. Determine relative frequencies for 27 tuples corresponding to pairs from matches and unmatches.
  16. sort tuples and create sets match_tuples, possible_tuples, and unmatch_tuples.
  17. generate matches, possible matches, and unmatches for all potential pairs from zagat and fodors,
    possibly after blocking on city.
  18. Return the generated matches, possible matches, and unmatches in DataFrames.
    Getting started
    We have seeded your repository with a directory for this assignment. To pick it up, change to your
    cmsc12200-win-21-username directory (where the string username should be replaced with your username), run git pull to make sure that your local copy of the repository is in sync with the server, and
    then run git pull upstream master to pick up the distribution.
    Submission
    Follow the usual submission instructions.
    Acknowledgement
    This assignment was originally designed and written by Amitabh Chaudhary.
    find_matches(mu, lambda_, block_on_city)

https://www.classes.cs.uchicago.edu/archive/2022/winter/12200-1/