DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 1
DATA2001 – Data Science,
Big Data, and Data Diversity
Week 12: Big Data
Presented by A/Prof Uwe Roehm
School of Computer Science
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 2
Learning Objectives
– Big Data
– The three V’s: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Agnostic Data Analytics Platforms
– Scale-Up vs. Scale-Out
– MapReduce principle; similarities to SQL
– Role of modern Big Data platforms
MapReduce, Apache Spark, Flink or Hive
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 3
Big Data
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 4
Big Data
the three Vs:
[cf. article by
Doug Laney, 2001]
[Barton Poulson “Techniques and Concepts of Big Data”, 2014]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 5
Big Data: Volume
– very relative due to Moore’s Law
– What once was considered big data, is considered a main-memory
problem nowadays
– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still …
– Nowadays: Terabyte to Exabyte
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 6
Big Data: Velocity
– conventional scientific research:
– months to gather data from 100s cases, weeks
to analyze the data and years to publish.
– Example: Iris flower data set by Edgar Anderson
and Ronal Fisher from 1936
– on the other end of the scale: Twitter
– average 6000 tweets/sec, 500 million per day
or 200 billion per year
– Cf. life Twitter Usage Statistics
http://www.internetlivestats.com/twitter-statistics/
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 7
Big Data: Variety
– Structured Data, such as CSV or RDBMS
– Semi-structured Data, such as JSON or XML
– Unstructured Data, ie. text, e-mails, images, video
– an estimated 80% of enterprise data is unstructured
– study by Forester Research: variety biggest challenge in Big Data
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 8
Big Data Examples
Big Data for Consumers (examples)
– Siri, Yelp!, Spotify, Amazon, Netflix, Google Now
– Some Big Data Variety examples:
– “Neighborland” App [https://neighborland.com]
– “WalkScore.com” [https://www.walkscore.com]
Big Data for Businesses (examples)
– Google Ads Searches
– Predictive Marketing
– Example “EDITED.com”: predicting fashion trends
– Fraud Detection
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 9
Big Data Examples: Big Data for Research
– Astronomy: Sloan Digital Sky Survey (SDDS) SkyServer
– Cern’s Large Hadron Collider (LHC)
– The Human Brain Project
– Personalities in the United States
(cf Journal of American Psychological Association)
– Google Flu trends (only historic data; stopped publishing new trends)
– Apple COVID19 Mobility Trends (https://www.apple.com/covid19/mobility)
– Google Books project
– (eg. changes of word usage over time (eg. maths vs arithmetic vs algebra)
https://books.google.com/ngrams/graph?content=math,arithmetic,algebra&
case_insensitive=off&year_start=1800 )
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 10
Big Data Challenges beyond Technical Aspects
– Data Privacy
– Some data sources, such as “Internet-of-Things”, allow tracking anyone

Do you really need to know who was travelling a route in order to predict, e.g.,
traffic densities?
Personal data can be inferred sometimes => New York Taxi data set example
– Privacy laws
Always check: Are you allowed to use some data or process is anywhere?
Some personal data, especially regarding health or tax, is specially protected;
e.g., not allowed to leave a jurisdictive area
e.g. EU’s General Data Protection Regulation (GDPR) applies to any company
holding data about any European Union citizen
– Data Security
– Can your users trust you to keep their data safe?
– Big data can expose your organization to serious privacy and security attacks!
“[…] consider that great responsibility follows inseparably from great power” [French National Convention,1793]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 11
Use Case: COVIDSafe App
– Tool to help contact tracing – who was in close contact to a known COVID-19 case?
– The app does not collect location data, but
just events of being in close proximity
of another COVIDSafe app user (via BT)
– Data is stored encrypted locally on the
phone for 21 days, then overwritten.
– Data only uploaded to cloud (AWS…)
on request after personal permission
– Benefit to society vs. Privacy concerns
– Which data collected and how stored?
Locally: anonymised close contacts (date, time, distance, duration, and other user’s refcode);
cloud: meta-data (refcode; phone#, nickname, age range, postcode)
– Where is data processed? => cloud, resp. by contact tracers
– Who has access to this data? => only contract tracers; protected by Biosecurity Privacy Laws
– Does it work? False positives/negatives are possible => risk of false sense of security…
https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 12
Big Data Challenges beyond Technical Aspects (cont’d)
– Data Discrimination
– Is it acceptable to discriminate against people based on data on their lives?
– Credit card scoring? Health insurance?
– Cf. FTC: “Big Data – A Tool for Inclusion or Exclusion?”
[https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-orexclusion-understanding-issues/160106big-data-rpt.pdf]
– Check:
– Are you working on a representative sample of users/consumers?
– Do your algorithms prioritize fairness? Aware of the biases in the data?
– Check your Big Data outcomes against traditionally applied statistics practices
– Keep in mind other Vs of Big Data:
– Validity (data quality), Veracity (data accuracy / trustworthiness), …
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 13
Big Data Examples
Customer
– Twitter Life Statistics: http://www.internetlivestats.com/twitter-statistics/
– Walkscore: https://www.walkscore.com
– Neighborland: https://neighborland.com
Business
– Predictive Marketing: EDITED.com
Journalism
– TimesMachine: http://timesmachine.nytimes.com/browser
– Panama Papers: http://panamapapers.sueddeutsche.de/en/
Research
– Cern LHC open data access: http://opendata.cern.ch/?ln=en
– SDDS SkyServer: http://skyserver.sdss.org/dr12/
– Human Brain Project: https://www.humanbrainproject.eu
– Google Flu Trends: https://www.google.org/flutrends/about/
– Google Books nGrams: https://books.google.com/ngrams/
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 14
Analysing Big Data
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 15
Data Science Workflow
[Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/]
Analysing ‘Big Data’ with “scripts”?
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 16
Case for Data Science Platforms
– Data is either
– too large (volume),
– too fast (velocity), or
– needs to be combined from diverse sources (variety)
for processing with scripts or on single server.
– Need for
– scalable platform
– processing abstractions
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 17
Jupyter Notebooks as Platform for Big Data?
Jupyter Server
(Python)
csv file read
psycopg,
…
Web
Browser
http
Database System
network
– This does not scale to petabyte of data
– Which approach for Amazon? Facebook? a, b , c
1, foo, 4.5
2, bar, 1.3
3, oho, 9.0
…
a, b , c
1, foo, 4.5
2, bar, 1.3
3, oho, 9.0
…
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 18
Case Study: LinkedIn
– Started in 2003
– 2700 members in first week
– Single database and web server
– for years experienced exponential growth…
– As of Jan 2018:
(https://www.omnicoreagency.com/linkedin-statistics/)
– 500 million members
– 250 million active users / month
– Many users with hundreds of connections => huge graph
– Fun Fact: Statistical Analysis and Data Mining are Top skills on Linkedin
– world’s 34th-most-popular website in terms of overall visitor traffic (Alexa, Dec-16)
(https://www.alexa.com/topsites)
For comparison: Microsoft is #37
Source: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 19
Scale-Up
– The traditional approach:
– To scale with increasing load,
buy more powerful, larger
hardware
from single workstation
to dedicated db server
to large massive-parallel
database appliance
[source: Jim Gray, HPTS99]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 20
The Alternative: Scale-Out
A single server has limits…
For real Big Data processing, need to
scale-out to a cluster of multiple servers (nodes):
shared-nothing architecture
State-of-the-Art:
[Source: Server.png from PinClipart.com]
[recall Wk5]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 21
Kafka
Azkaban
Hadoop
Hive
Hadoop
Hive
Hadoop
Hive
Hadoop
Hive
Hadoop
Hive …
Case Study: LinkedIn Analytical Architecture
”We have multiple grids divided up based upon purpose.
Hardware:
~800 Westmere-based HP SL 170x, with 2×4 cores, 24GB RAM, 6x2TB SATA
~1900 Westmere-based SuperMicro X8DTT-H, with 2×6 cores, 24GB RAM, 6x2TB SATA
~1400 Sandy Bridge-based SuperMicro with 2×6 cores, 32GB RAM, 6x2TB SATA
…
We use these things for discovering People You May Know and other fun facts.”
LinkdIn via https://wiki.apache.org/hadoop/PoweredBy/
…
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 22
Challenges
n Scale-Agnostic Data Management
n sharding for performance
n replication for availability
n ideally such that applications are unaware of underlying complexities
n cf. Week 5
n Scale-Agnostic Data Processing
n Nowadays we collect massive amounts of data; how can we analyze it?
n Answer: use lots of machines… (hundreds/thousands of CPUs, can grow)
n Performance: parallel processing
n Availability: Ideally, the system never down; can handle failures transparent
=> Map/Reduce processing paradigm
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 23
Scale-Agnostic Data Analysis
The MapReduce Principle
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 24
Big Data Analytics Stack
– Layered stack of frameworks for distributed data management and processing
– Many choices of distributed data processing platforms
Application
Storage
Infrastructure
[slide by Ion Stoica, UCB, 2013]
Data Processing MapReduce
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 25
MapReduce Overview
– Scan large volumes of data
– Map: Extract some interesting information
– Shuffle and sort intermediate results
– Reduce: aggregate intermediate results
– Generate final output
– Key idea: provide an abstraction at the point of these two operations (map
and reduce)
– Higher-order functions
– Cf. map functions in functional programming languages such as Lisp or
Haskell
12-25
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 26
MapReduce Paradigm
– Functional Programming approach to data processing
– map() : applies a given function f to all elements of a collection;
returns a new collection
– map (f, originalList)
Diagram from Yahoo! Hadoop Tutorial
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 27
MapReduce Paradigm: Reduce()
Keys divide the reduce space
all of the output values are not usually reduced together. All of the
values with the same key are presented to a single reducer together
Diagram from Yahoo! Hadoop Tutorial
reduce(): applies a given function g to all elements of an input list;
produces, starting from a given initial value, a single (aggregate) output value
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 28
Similarities between SQL-Queries and MapReduce
– A standard map-reduce task is similar in its functionality to declarative
aggregation queries in SQL:
SELECT out_key, reduce(out_value)
FROM map(InputData)
GROUP BY out_key
New in MR-Paradigm: map and reduce() as higher-order functions
which take a user-defined function with the actual functionality.
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 29
Example: Word Count program
– Word Count programmed as standard linear program
– Two nested for loops
– Difficult to generalise or parallelise
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 30
Example: Word Count
– Input:
– List of documents that contain text
– Provided to MapReduce in the form of
(k: documentID, v: textcontent) pairs
– Goal:
– Determine which words occur in the documents and how often
– E.g. for text indexing…
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 31
MapReduce Approach
To solve the same problem using MapReduce, we need

map() function (aka mapper)
reduce() function (aka reducer)
Some control code that connects mapper and reducers
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 32
Example: Word Count in MapReduce
– Word Count programmed using Map/Reduce paradigm
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 33
MapReduce Generalised
– Previous example was hard-coded for word count
– We can generalise the pattern for the driver code even further
mapper and reducer are now also inputs
Call to function needs 3 arguments: data, mapper and reducer
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 34
Example Word Count with MapReduce
Decentralised
Structured
Storage System
Google File
System
Distributed
Storage System
Structured Data
doc1
doc2
doc3
(in_key, in_value)
map
(Google, 1)
(File, 1)
(System, 1)
(Decentralized, 1)
(Structured, 1)
(Storage, 1)
(System, 1)
(Distributed, 1)
(Storage, 1)
(System, 1)
(Structured,1)
(Data, 1)
(out_key, value)
(Google, {1})
(File, {1})
(System, {1,1,1})
(Decentralized,{1})
(Structured, {1,1})
(Storage,{1,1})
(Distributed, {1})
(Data,{1})
shuffle
(out_key,list(value))
(Google, 1)
(File, 1)
(System, 3)
(Decentralized,1)
(Structured,2)
(Storage,2)
(Distributed,1)
(Data,1)
reduce
(out_key, out_value)
Map Phase Reduce Phase
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 35
Why Scale-Agnostic?
– Note that the functions given to map() and reduce() only rely on local input
– functions without side-effects and independent of each other
– function invocation is agnostic to the scale (size) of the overall dataset
– Can hence be parallelized easily
– Partition the dataset over multiple nodes
– apply different instances of the same map/reduce functions to each
partition independently / in parallel
– Fits perfectly to a scale-out approach
– bigger data => more nodes and data partitions => more parallelism
=> same or faster speed
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 36
MapReduce Discussion
Pros:
– very flexible due to the
user-defined functions
– great scalability
because FP approach
– easy parallelism due to
stateless functions
– fault-tolerance
Cons:
– requires programming skills
and functional thinking
– relatively low-level, even
filtering to be coded manually
– complex frameworks
– batch-processing oriented
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 37
Distributed, Dataflow-Oriented Analytics Platforms
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 38
Challenge: Iterative Algorithms
– Many data mining and machine learning algorithms rely on global state and
iterations
– Examples:
– data clustering (eg. k-Means)
– frequent itemset mining (eg. Apriori algorithm)
– linear regression
– collaborative filtering
– PageRank
– … R
M
J
M
R R
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 39
Distributed Data Analytics Frameworks
– Apache Hadoop
– Open-source implementation of original MapReduce from Google; Apache top-level project
– Java framework, but also provides a Python interface nowadays
– Parts: own distributed file system (HDFS), job scheduler (YARN), MR framework (Hadoop)
– Apache Spark
– Distributed cluster computing framework on top of HDFS/YARN
– Concentrates on main-memory processing and more high-level data flow control
– Originates from research project from UC Berkeley
– Apache Flink
– Efficient data flow runtime on top of HDFS/YARN
– Similar to Spark, but more emphasize on build-in dataflow optimiser and pipelined processing
– Strong for data stream processing
– Origin: Stratosphere research project by TU Berlin, Humboldt University Berlin and HPI Potsdam
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 40
Distributed Data Analytics Frameworks (continued)
– Apache Hive
– Provides an SQL-like interface on top of Hadoop / HDFS
– Allows to define a relational schema on top of HDFS files, and to query and analyse data with
HiveQL (SQL dialect)
– Queries automatically translated to MR jobs and executed in parallel in cluster
– Example: WordCount in HIVE
DROP TABLE IF EXISTS docs;
CREATE TABLE docs (line STRING);
LOAD DATA INPATH ‘input_file’ OVERWRITE INTO TABLE docs;
CREATE TABLE word_counts AS
SELECT word, count(1) AS count
FROM (SELECT explode(split(line, ‘\s’)) AS word FROM docs) temp
GROUP BY word
ORDER BY word;
– Many more high-level frameworks for advanced data analytics.
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 41
Example Dataflow Execution Stack: Apache Spark
[Apache Spark Architecture, Apache website]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 42
Example: WordCount in Apache Flink (Python)
from flink.plan.Environment import get_environment
from flink.functions.GroupReduceFunction import GroupReduceFunction
env = get_environment()
data= env.read_text(“hdfs://…”);
data.flat_map(lambda x,c: [(word,1) for word in x.lower().split()]) \
.group_by(0) \
.sum(1) \
.write_csv(“hdfs://…”)
env.execute()
[Cf.: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 43
Summary
– Big Data
– The three V’s: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Up versus Scale-Out
– Scale-Agnostic Computation
– Parallelisable higher-order functions map & reduce
– MapReduce principle; similarities to existing material and SQL
– Scale-Agnostic Data Analytics Platforms
– Data Scientists need more high-level tools and interfaces than MapReduce
– Examples: Apache Spark or Apache Flink or Apache Hive
– Componentized infrastructure: SQL querying, ML-Libraries, Streaming, etc.
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 44
Learn More
– DATA3404 Data Science Platforms

https://www.sydney.edu.au/units/DATA2001

标签

# AU澳大利亚代写 # DATA2001 # Exam代考 # SQL代写 # USYD悉尼大学

DATA2001 Data Science, Big Data and Data Variety 2022 S1 | USYD悉尼大学 | exam代考 | sql代写

MSU密歇根州立大学 | PLS 202 Introduction to Data Analytics and the Social Sciences | R语言 Final Project代写 ggplot2

MSU密歇根州立大学 | PLS 202 Introduction to Data Analytics and the Social Sciences | R语言 Assignment5代写 linear regression

MSU密歇根州立大学 | PLS 202 Introduction to Data Analytics and the Social Sciences | R语言 Assignment4代写 ggplot2

MSU密歇根州立大学 | PLS 202 Introduction to Data Analytics and the Social Sciences | R语言 Assignment3代写

MSU密歇根州立大学 | PLS 202 Introduction to Data Analytics and the Social Sciences | R语言 Assignment2代写

相关文章