Intelligent Systems

Inferring Missing User Attributes from User Generated Content

Users of social networks generate a lot of information about themselves in a variety of ways. Users create an account and share structured data such as birth date, gender, geographical location, etc. In addition, users share unstructured data such as textual data (free text description about themselves, blog posts, status updates, comments, etc.) and multimedia data (uploaded photos and videoclips). Furthermore, users form relationships with other users, explicitly as e.g. friends or followers, and/or implicitly through interactions such as commenting on each other's content. All this data provides a potentially very rich source of information for business intelligence applications that leverage this content for personalisation, such as onNline marketing. This project focuses on the use of machine learning techniques to derive missing attribute values of users, such as age and gender, from their user generated content and activities in the social network. The project will be carried out in close cooperation with Golnoosh Farnadi of Ghent University. For more information please contact Martine De Cock (mdecock@u.washington.edu).

Author Name Disambiguation

We present a system called ALIAS, that is designed to search for duplicate authors from Microsoft Academic Search Engine dataset. Authorambiguity is a prevalent problem in this dataset, as many authors publish under several variations of their own name, or different authors share similar or same name. ALIAS takes an author name as an input (who may or may not exist in the corpus), and outputs a set of author names from the database, that are determined as duplicates of the inputted author. It also provides a confidence score, associated with each output. Additionally, ALIAS has the feature of finding a Top-k list of similar authors, given an input author name. The underlying techniques heavily rely on exhaustive feature engineering, supervised learning algorithms, partitioning, clustering, and performing efficient similarity search to enable fast response for near real time user interaction.