Workshop Abstract: This workshop will provide a hands-on introduction to the Big Data ecosystem, Hadoop and Apache Spark in practice. Through practical activities in Python, you will learn how to apply Apache Spark on a range of datasets to process and analyse data at scale. After taking this workshop you will be able to: – Understand the challenges in the Big Data ecosystem – Describe the fundamentals of the Hadoop ecosystem – Use the core Spark RDD APIs to express data processing queries – Understand how you can leverage cloud technologies such as Amazon EMR to process large data sets.
* What they need to bring to the workshop: laptop, Python 3.6, pyspark *
Talk Abstract: Making Sense of Big Data File formats * Modern applications generate and manipulate a lot of data. The growth rate of the data is staggering. Unfortunately, large datasets can be expensive to store at large scale and also slow to process. In fact, memory speed has been evolving at a much lower rate in comparison to CPUs. Thankfully, there are various file formats suited for big data systems to help. In this webinar, you will learn about popular file formats suitable for big data systems with a focus on Parquet. Through live coded examples in Python, you will learn the good, the bad, the ugly, and how you can make use of Parquet in practice.
Bio: Raoul-Gabriel Urma is the director of Cambridge Spark, a leading learning community for data scientists and developers in UK. In addition, he is also Chairman and co-founder of Cambridge Coding Academy, a growing community of young coders and pre-university students. Raoul is author of the bestselling programming book “Java 8 in Action” which sold over 25,000 copies globally. Raoul completed a PhD in Computer Science at the University of Cambridge. Raoul has delivered over 100 technical talks at international conferences. He has worked for Google, eBay, Oracle, and Goldman Sachs. He is also a Fellow of the Royal Society of Arts.