Alias Matching
This is an adaptation of the alias merging algorithm from Mining Email Social Networks. The steps of this method is straightforward: the complete weighted graph is built with contributors as nodes. The weight of each edge represents the similarity between two users based on the string comparison techniques. After that, the graph is split into clusters based on the distance between nodes.
In the Gerrit data, contributors can have 3 features: name, login, and e-mail.
Name normalization. We remove all punctuation, suffixes, and prefixes from the names. Additionally, titles and clarification terms (‘admin’, ‘support’) are removed. After that, the name is transformed into lowercase. Lastly, first and last names are calculated as the first and last part term after a split by space symbol.
Name similarity. We measure the similarity between two names with a Levenstein distance normalized to [0, 1] interval. The distance is calculated between full names, first names, and second names separately. The resulting name similarity is the minimum distance between full names and averaged distances between first and last names.
Name-email similarity. The distance between the name and the email is a 0-1 measure. The distance is deemed to be 0 if and only if the e-mail consists of the full name or first and last name.
E-mail similarity. For email similarity, we calculate normalized Levenstein distance between two bases (part of the e-mail preceding to @).
Login similarity. Normalized Levenstein distance between two logins
Login-email similarity. Levenstein distance between the base of the e-mail and the login.
Name-login similarity. Same as name-email distance
User identifier distance. The resulting distance between two users is the minimum of the metrics above.
After the distance is calculated, we split users into clusters via Agglomerative Clustering with a complete linkage. All developers in one cluster are treated as the same and mapped to the one resulting id.
- class aliasmatching.BirdMatching(distance_threshold=0.1, name_coef=1, email_name_coef=1, email_coef=1, login_coef=1, login_email_coef=0, login_name_coef=0)
Adaptation of the approach from the ‘Mining Email Social Networks’. Algorithm measures pair-wise similarity between all participants and splits them into clusters with Agglomerative clustering. All distances are values from 0 to 1. For each measure <measure>_coef (should be from 0 to 1) adjusts influence of the measure: adjusted_measure = 1 - (1 - measure) * <measure>_coef
- Parameters:
users – dataframe with that contains columns ‘name’, ‘login’, and ‘email’ with corresponding information in it. If ‘full_id’ column is present, will be treated as an id of each user, otherwise custom id will be constructed and added to the DataFrame
distance_threshold – distance parameter for clustering
name_coef – weight for name similarity
email_name_coef – weight for name-email similarity
email_score_coef – weight for e-mail similarity
login_score_coef – weight for login similarity
login_email_coef – weight for login-email similarity
login_name_coef – weight for login-name similarity
- distance(u1, u2)
Computes distance between two users :param u1: features of user_1 :param u2: features of user_2 :return: distance between two users
- get_clusters(users)
Splits users id into clusters which represent one user with different aliases
- Parameters:
users – dataframe with users names, emails, and logins
distance_threshold – distance parameter for clustering
- Returns:
dict which provides cluster id for each user
- process(users)
For each user_id assigns cluster number. Ids with same cluster should be treated as the same user
- Parameters:
users – dataframe with users names, emails, and logins
- Returns:
users dataframe with ‘cluster’. Users that are deemed as one have same number in ‘cluster’ column