Mining Large Scale Datasets

Information

Teachers: Sérgio Matos

Duration: One semester

Work hours: 162

Contact hours: 45

ECTS: 6

Scientific area: Computer Science

Objectives

The objective of this curricular unit is to expose students to the challenges and algorithmic strategies for large-scale data analysis, as well as to the computational tools and frameworks that allow implementing such algorithms.

Learning Outcomes

After completing this curricular unit, students should be able to identify and apply the best approaches to model and extract knowledge from large-scale data, data streams and graphs. Students will also gain experience with data processing platforms such as Apache Hadoop, Spark, Hive.

Requirements

Ability to program in Python (or Java) is essential.
Use of the Linux shell terminal is encouraged.

Grading

60.00% Practical assignments
40.00% Written exam

Methodology

The course will combine theoretical presentation of the algorithms and strategies to approach the different problem formulations, with a set of mini-projects oriented to help students consolidate theoretical notions and gain practical experience. Although mathematical formulations of algorithms are essential to give students an analytical understanding of methods, emphasis will be placed on applying this knowledge to specific problems through practical implementations.

Syllabus

Introduction: Data Mining; Challenges in large-scale mining; MapReduce, Hadoop, Spark.
Finding Frequent itemsets: 'market-basket' model; Association rules; A-Priori Algorithm
Similarity search: Shingling, minhashing, Locality-sensitive hashing; Similarity metrics; Approximate Nearest Neighbors
Data Streams: Sampling and Filtering; Window
Machine Learning on Large Datasets: Clustering Methods; Classification methods; Parallelization
Link analysis: Efficient PageRank algorithms; Hubs and Authorities
Analysis and Network Mining: Partitions and Communities; Mining social networks; Biological network analysis