Mining Large Scale Datasets

Information

Teachers: Sérgio Matos

Duration: One semester

Work hours: 162

Contact hours: 45

ECTS: 6

Scientific area: Computer Science

Objectives

The objective of this curricular unit is to expose students to the challenges and algorithmic strategies for large-scale data analysis, as well as to the computational tools and frameworks that allow implementing such algorithms.

Learning Outcomes

After completing this curricular unit, students should be able to identify and apply the best approaches to model and extract knowledge from large-scale data, data streams and graphs. Students will also gain experience with data processing platforms such as Apache Hadoop, Spark, Hive.

Requirements

  • Ability to program in Python (or Java) is essential.
  • Use of the Linux shell terminal is encouraged.

Grading

  • 60.00% Practical assignments
  • 40.00% Written exam

Methodology

The course will combine theoretical presentation of the algorithms and strategies to approach the different problem formulations, with a set of mini-projects oriented to help students consolidate theoretical notions and gain practical experience. Although mathematical formulations of algorithms are essential to give students an analytical understanding of methods, emphasis will be placed on applying this knowledge to specific problems through practical implementations.

Syllabus

  • Introduction: Data Mining; Challenges in large-scale mining; MapReduce, Hadoop, Spark.
  • Finding Frequent itemsets: 'market-basket' model; Association rules; A-Priori Algorithm
  • Similarity search: Shingling, minhashing, Locality-sensitive hashing; Similarity metrics; Approximate Nearest Neighbors
  • Data Streams: Sampling and Filtering; Window
  • Machine Learning on Large Datasets: Clustering Methods; Classification methods; Parallelization
  • Link analysis: Efficient PageRank algorithms; Hubs and Authorities
  • Analysis and Network Mining: Partitions and Communities; Mining social networks; Biological network analysis

Recommended reading

  • Leskovec, Rajaraman, Ullman. “Mining of massive datasets”. Cambridge University Press, 2014. Available at http://www.mmds.org/
  • Easley, Kleinberg. “Networks, Crowds, and Markets: Reasoning About a Highly Connected World”. Cambridge University Press, 2010. Available at http://www.cs.cornell.edu/home/kleinber/networks-book/
  • White. “Hadoop: The Definitive Guide”, Fourth Edition. O’Reilly Media, 2015.
  • Kerzner, Maniyam. “Hadoop illuminated”. Online book: http://hadoopilluminated.com/