Fuzzy rough set theory is a popular and powerful machine learning tool. It is especially suitable for dealing with information systems that exhibit inconsistencies, i.e. objects that have the same values for the conditional attributes but belong to different categories. A main computational task in fuzzy rough set theory is the computation of the upper and lower approximation sets. In very large information systems, the computation of the fuzzy rough set upper and lower approximations is very demanding, both in terms of runtime and in terms of memory. Existing non-distributed implementations are limited by memory capacity. In this talk, we present a parallel and distributed solution to compute fuzzy rough approximations in very large information systems with millions of records. We also present a distributed prototype selection approach that is based on fuzzy rough set theory and couple it with our distributed implementation of the well known k-nearest neighbors machine learning prediction technique to solve regression problems. In addition, we show how our distributed approaches can be used on the State Inpatient Data Set (SID) to predict the total healthcare expenses of patients. All our algorithms have been implemented in Spark, and in the talk we will also briefly touch upon a comparison with MPI.
Speaker: Hasan Asfoor
Hasan was born and raised in Saudi Arabia. He has a Bachelor's degree in Software Engineering from King Fahd University of Petroleum & Minerals. He works for Saudi Aramco since 2008 where he joined the Expec Computer Center and is working in the field of data management, databases and application development. Currently, he is a student in the Masters in Computer Science & Systems pogram at the University of Washington - Tacoma focusing on topics in big data, parallel computing and machine learning, under the supervision of Prof. Martine De Cock, Prof. Ankur Teredesai, Prof. Matthew Tolentino and Prof. Chris Cornelis. His research is particularly focused around dealing with fuzzy rough set computations in large information systems.
Hasan Asfoor (UW CDS) Fuzzy Rough Set Approximations in Large Information Systems with Spark