DATA2001 Data Science, Big Data and Data Variety 2022 S1 | USYD悉尼大学 | exam代考 | sql代写

DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 1
DATA2001 – Data Science,
Big Data, and Data Diversity
Week 12: Big Data
Presented by A/Prof Uwe Roehm
School of Computer Science
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 2
Learning Objectives
– Big Data
– The three V’s: Volume, Velocity and Variety
– Ethical challenges for Big Data Processing
– Scale-Agnostic Data Analytics Platforms
– Scale-Up vs. Scale-Out
– MapReduce principle; similarities to SQL
– Role of modern Big Data platforms
MapReduce, Apache Spark, Flink or Hive
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 3
Big Data
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 4
Big Data
the three Vs:
[cf. article by
Doug Laney, 2001]
[Barton Poulson “Techniques and Concepts of Big Data”, 2014]
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 5
Big Data: Volume
– very relative due to Moore’s Law
– What once was considered big data, is considered a main-memory
problem nowadays
– eg. Excel: In 2003 max 65000 rows, now max 1 million rows, still …
– Nowadays: Terabyte to Exabyte
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 6
Big Data: Velocity
– conventional scientific research:
– months to gather data from 100s cases, weeks
to analyze the data and years to publish.
– Example: Iris flower data set by Edgar Anderson
and Ronal Fisher from 1936
– on the other end of the scale: Twitter
– average 6000 tweets/sec, 500 million per day
or 200 billion per year
– Cf. life Twitter Usage Statistics
http://www.internetlivestats.com/twitter-statistics/
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 7
Big Data: Variety
– Structured Data, such as CSV or RDBMS
– Semi-structured Data, such as JSON or XML
– Unstructured Data, ie. text, e-mails, images, video
– an estimated 80% of enterprise data is unstructured
– study by Forester Research: variety biggest challenge in Big Data
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 8
Big Data Examples
Big Data for Consumers (examples)
– Siri, Yelp!, Spotify, Amazon, Netflix, Google Now
– Some Big Data Variety examples:
– “Neighborland” App [https://neighborland.com]
– “WalkScore.com” [https://www.walkscore.com]
Big Data for Businesses (examples)
– Google Ads Searches
– Predictive Marketing
– Example “EDITED.com”: predicting fashion trends
– Fraud Detection
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 9
Big Data Examples: Big Data for Research
– Astronomy: Sloan Digital Sky Survey (SDDS) SkyServer
– Cern’s Large Hadron Collider (LHC)
– The Human Brain Project
– Personalities in the United States
(cf Journal of American Psychological Association)
– Google Flu trends (only historic data; stopped publishing new trends)
– Apple COVID19 Mobility Trends (https://www.apple.com/covid19/mobility)
– Google Books project
– (eg. changes of word usage over time (eg. maths vs arithmetic vs algebra)
https://books.google.com/ngrams/graph?content=math,arithmetic,algebra&
case_insensitive=off&year_start=1800 )
DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 10
Big Data Challenges beyond Technical Aspects
– Data Privacy
– Some data sources, such as “Internet-of-Things”, allow tracking anyone

  • Do you really need to know who was travelling a route in order to predict, e.g.,
    traffic densities?
  • Personal data can be inferred sometimes => New York Taxi data set example
    – Privacy laws
  • Always check: Are you allowed to use some data or process is anywhere?
  • Some personal data, especially regarding health or tax, is specially protected;
    e.g., not allowed to leave a jurisdictive area
  • e.g. EU’s General Data Protection Regulation (GDPR) applies to any company
    holding data about any European Union citizen
    – Data Security
    – Can your users trust you to keep their data safe?
    – Big data can expose your organization to serious privacy and security attacks!
    “[…] consider that great responsibility follows inseparably from great power” [French National Convention,1793]
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 11
    Use Case: COVIDSafe App
    – Tool to help contact tracing – who was in close contact to a known COVID-19 case?
    – The app does not collect location data, but
    just events of being in close proximity
    of another COVIDSafe app user (via BT)
    – Data is stored encrypted locally on the
    phone for 21 days, then overwritten.
    – Data only uploaded to cloud (AWS…)
    on request after personal permission
    – Benefit to society vs. Privacy concerns
    – Which data collected and how stored?
  • Locally: anonymised close contacts (date, time, distance, duration, and other user’s refcode);
    cloud: meta-data (refcode; phone#, nickname, age range, postcode)
    – Where is data processed? => cloud, resp. by contact tracers
    – Who has access to this data? => only contract tracers; protected by Biosecurity Privacy Laws
    – Does it work? False positives/negatives are possible => risk of false sense of security…
    https://www.health.gov.au/resources/apps-and-tools/covidsafe-app
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 12
    Big Data Challenges beyond Technical Aspects (cont’d)
    – Data Discrimination
    – Is it acceptable to discriminate against people based on data on their lives?
    – Credit card scoring? Health insurance?
    – Cf. FTC: “Big Data – A Tool for Inclusion or Exclusion?”
    [https://www.ftc.gov/system/files/documents/reports/big-data-tool-inclusion-orexclusion-understanding-issues/160106big-data-rpt.pdf]
    – Check:
    – Are you working on a representative sample of users/consumers?
    – Do your algorithms prioritize fairness? Aware of the biases in the data?
    – Check your Big Data outcomes against traditionally applied statistics practices
    – Keep in mind other Vs of Big Data:
    – Validity (data quality), Veracity (data accuracy / trustworthiness), …
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 13
    Big Data Examples
    Customer
    – Twitter Life Statistics: http://www.internetlivestats.com/twitter-statistics/
    – Walkscore: https://www.walkscore.com
    – Neighborland: https://neighborland.com
    Business
    – Predictive Marketing: EDITED.com
    Journalism
    – TimesMachine: http://timesmachine.nytimes.com/browser
    – Panama Papers: http://panamapapers.sueddeutsche.de/en/
    Research
    – Cern LHC open data access: http://opendata.cern.ch/?ln=en
    – SDDS SkyServer: http://skyserver.sdss.org/dr12/
    – Human Brain Project: https://www.humanbrainproject.eu
    – Google Flu Trends: https://www.google.org/flutrends/about/
    – Google Books nGrams: https://books.google.com/ngrams/
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 14
    Analysing Big Data
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 15
    Data Science Workflow
    [Source: http://cacm.acm.org/blogs/blog-cacm/169199-data-science-workflow-overview-and-challenges/]
    Analysing ‘Big Data’ with “scripts”?
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 16
    Case for Data Science Platforms
    – Data is either
    – too large (volume),
    – too fast (velocity), or
    – needs to be combined from diverse sources (variety)
    for processing with scripts or on single server.
    – Need for
    – scalable platform
    – processing abstractions
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 17
    Jupyter Notebooks as Platform for Big Data?
    Jupyter Server
    (Python)
    csv file read
    psycopg,

    Web
    Browser
    http
    Database System
    network
    – This does not scale to petabyte of data
    – Which approach for Amazon? Facebook? a, b , c
    1, foo, 4.5
    2, bar, 1.3
    3, oho, 9.0

    a, b , c
    1, foo, 4.5
    2, bar, 1.3
    3, oho, 9.0

    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 18
    Case Study: LinkedIn
    – Started in 2003
    – 2700 members in first week
    – Single database and web server
    – for years experienced exponential growth…
    – As of Jan 2018:
    (https://www.omnicoreagency.com/linkedin-statistics/)
    – 500 million members
    – 250 million active users / month
    – Many users with hundreds of connections => huge graph
    – Fun Fact: Statistical Analysis and Data Mining are Top skills on Linkedin
    – world’s 34th-most-popular website in terms of overall visitor traffic (Alexa, Dec-16)
    (https://www.alexa.com/topsites)
  • For comparison: Microsoft is #37
    Source: https://engineering.linkedin.com/architecture/brief-history-scaling-linkedin
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 19
    Scale-Up
    – The traditional approach:
    – To scale with increasing load,
    buy more powerful, larger
    hardware
  • from single workstation
  • to dedicated db server
  • to large massive-parallel
    database appliance
    [source: Jim Gray, HPTS99]
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 20
    The Alternative: Scale-Out
    A single server has limits…
    For real Big Data processing, need to
    scale-out to a cluster of multiple servers (nodes):
    shared-nothing architecture
    State-of-the-Art:
    [Source: Server.png from PinClipart.com]
    [recall Wk5]
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 21
    Kafka
    Azkaban
    Hadoop
    Hive
    Hadoop
    Hive
    Hadoop
    Hive
    Hadoop
    Hive
    Hadoop
    Hive …
    Case Study: LinkedIn Analytical Architecture
    ”We have multiple grids divided up based upon purpose.
    Hardware:
    ~800 Westmere-based HP SL 170x, with 2×4 cores, 24GB RAM, 6x2TB SATA
    ~1900 Westmere-based SuperMicro X8DTT-H, with 2×6 cores, 24GB RAM, 6x2TB SATA
    ~1400 Sandy Bridge-based SuperMicro with 2×6 cores, 32GB RAM, 6x2TB SATA

    We use these things for discovering People You May Know and other fun facts.”
    LinkdIn via https://wiki.apache.org/hadoop/PoweredBy/

    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 22
    Challenges
    n Scale-Agnostic Data Management
    n sharding for performance
    n replication for availability
    n ideally such that applications are unaware of underlying complexities
    n cf. Week 5
    n Scale-Agnostic Data Processing
    n Nowadays we collect massive amounts of data; how can we analyze it?
    n Answer: use lots of machines… (hundreds/thousands of CPUs, can grow)
    n Performance: parallel processing
    n Availability: Ideally, the system never down; can handle failures transparent
    => Map/Reduce processing paradigm
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 23
    Scale-Agnostic Data Analysis
    The MapReduce Principle
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 24
    Big Data Analytics Stack
    – Layered stack of frameworks for distributed data management and processing
    – Many choices of distributed data processing platforms
    Application
    Storage
    Infrastructure
    [slide by Ion Stoica, UCB, 2013]
    Data Processing MapReduce
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 25
    MapReduce Overview
    – Scan large volumes of data
    – Map: Extract some interesting information
    – Shuffle and sort intermediate results
    – Reduce: aggregate intermediate results
    – Generate final output
    – Key idea: provide an abstraction at the point of these two operations (map
    and reduce)
    – Higher-order functions
    – Cf. map functions in functional programming languages such as Lisp or
    Haskell
    12-25
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 26
    MapReduce Paradigm
    – Functional Programming approach to data processing
    – map() : applies a given function f to all elements of a collection;
    returns a new collection
    – map (f, originalList)
    Diagram from Yahoo! Hadoop Tutorial
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 27
    MapReduce Paradigm: Reduce()
    Keys divide the reduce space
    all of the output values are not usually reduced together. All of the
    values with the same key are presented to a single reducer together
    Diagram from Yahoo! Hadoop Tutorial
    reduce(): applies a given function g to all elements of an input list;
    produces, starting from a given initial value, a single (aggregate) output value
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 28
    Similarities between SQL-Queries and MapReduce
    – A standard map-reduce task is similar in its functionality to declarative
    aggregation queries in SQL:
    SELECT out_key, reduce(out_value)
    FROM map(InputData)
    GROUP BY out_key
    New in MR-Paradigm: map and reduce() as higher-order functions
    which take a user-defined function with the actual functionality.
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 29
    Example: Word Count program
    – Word Count programmed as standard linear program
    – Two nested for loops
    – Difficult to generalise or parallelise
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 30
    Example: Word Count
    – Input:
    – List of documents that contain text
    – Provided to MapReduce in the form of
    (k: documentID, v: textcontent) pairs
    – Goal:
    – Determine which words occur in the documents and how often
    – E.g. for text indexing…
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 31
    MapReduce Approach
    To solve the same problem using MapReduce, we need
  1. map() function (aka mapper)
  2. reduce() function (aka reducer)
  3. Some control code that connects mapper and reducers
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 32
    Example: Word Count in MapReduce
    – Word Count programmed using Map/Reduce paradigm
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 33
    MapReduce Generalised
    – Previous example was hard-coded for word count
    – We can generalise the pattern for the driver code even further
    mapper and reducer are now also inputs
    Call to function needs 3 arguments: data, mapper and reducer
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 34
    Example Word Count with MapReduce
    Decentralised
    Structured
    Storage System
    Google File
    System
    Distributed
    Storage System
    Structured Data
    doc1
    doc2
    doc3
    (in_key, in_value)
    map
    (Google, 1)
    (File, 1)
    (System, 1)
    (Decentralized, 1)
    (Structured, 1)
    (Storage, 1)
    (System, 1)
    (Distributed, 1)
    (Storage, 1)
    (System, 1)
    (Structured,1)
    (Data, 1)
    (out_key, value)
    (Google, {1})
    (File, {1})
    (System, {1,1,1})
    (Decentralized,{1})
    (Structured, {1,1})
    (Storage,{1,1})
    (Distributed, {1})
    (Data,{1})
    shuffle
    (out_key,list(value))
    (Google, 1)
    (File, 1)
    (System, 3)
    (Decentralized,1)
    (Structured,2)
    (Storage,2)
    (Distributed,1)
    (Data,1)
    reduce
    (out_key, out_value)
    Map Phase Reduce Phase
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 35
    Why Scale-Agnostic?
    – Note that the functions given to map() and reduce() only rely on local input
    – functions without side-effects and independent of each other
    – function invocation is agnostic to the scale (size) of the overall dataset
    – Can hence be parallelized easily
    – Partition the dataset over multiple nodes
    – apply different instances of the same map/reduce functions to each
    partition independently / in parallel
    – Fits perfectly to a scale-out approach
    – bigger data => more nodes and data partitions => more parallelism
    => same or faster speed
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 36
    MapReduce Discussion
    Pros:
    – very flexible due to the
    user-defined functions
    – great scalability
    because FP approach
    – easy parallelism due to
    stateless functions
    – fault-tolerance
    Cons:
    – requires programming skills
    and functional thinking
    – relatively low-level, even
    filtering to be coded manually
    – complex frameworks
    – batch-processing oriented
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 37
    Distributed, Dataflow-Oriented Analytics Platforms
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 38
    Challenge: Iterative Algorithms
    – Many data mining and machine learning algorithms rely on global state and
    iterations
    – Examples:
    – data clustering (eg. k-Means)
    – frequent itemset mining (eg. Apriori algorithm)
    – linear regression
    – collaborative filtering
    – PageRank
    – … R
    M
    J
    M
    R R
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 39
    Distributed Data Analytics Frameworks
    – Apache Hadoop
    – Open-source implementation of original MapReduce from Google; Apache top-level project
    – Java framework, but also provides a Python interface nowadays
    – Parts: own distributed file system (HDFS), job scheduler (YARN), MR framework (Hadoop)
    – Apache Spark
    – Distributed cluster computing framework on top of HDFS/YARN
    – Concentrates on main-memory processing and more high-level data flow control
    – Originates from research project from UC Berkeley
    – Apache Flink
    – Efficient data flow runtime on top of HDFS/YARN
    – Similar to Spark, but more emphasize on build-in dataflow optimiser and pipelined processing
    – Strong for data stream processing
    – Origin: Stratosphere research project by TU Berlin, Humboldt University Berlin and HPI Potsdam
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 40
    Distributed Data Analytics Frameworks (continued)
    – Apache Hive
    – Provides an SQL-like interface on top of Hadoop / HDFS
    – Allows to define a relational schema on top of HDFS files, and to query and analyse data with
    HiveQL (SQL dialect)
    – Queries automatically translated to MR jobs and executed in parallel in cluster
    – Example: WordCount in HIVE
    DROP TABLE IF EXISTS docs;
    CREATE TABLE docs (line STRING);
    LOAD DATA INPATH ‘input_file’ OVERWRITE INTO TABLE docs;
    CREATE TABLE word_counts AS
    SELECT word, count(1) AS count
    FROM (SELECT explode(split(line, ‘\s’)) AS word FROM docs) temp
    GROUP BY word
    ORDER BY word;
    – Many more high-level frameworks for advanced data analytics.
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 41
    Example Dataflow Execution Stack: Apache Spark
    [Apache Spark Architecture, Apache website]
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 42
    Example: WordCount in Apache Flink (Python)
    from flink.plan.Environment import get_environment
    from flink.functions.GroupReduceFunction import GroupReduceFunction
    env = get_environment()
    data= env.read_text(“hdfs://…”);
    data.flat_map(lambda x,c: [(word,1) for word in x.lower().split()]) \
    .group_by(0) \
    .sum(1) \
    .write_csv(“hdfs://…”)
    env.execute()
    [Cf.: https://ci.apache.org/projects/flink/flink-docs-release-1.2/dev/batch/python.html]
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 43
    Summary
    – Big Data
    – The three V’s: Volume, Velocity and Variety
    – Ethical challenges for Big Data Processing
    – Scale-Up versus Scale-Out
    – Scale-Agnostic Computation
    – Parallelisable higher-order functions map & reduce
    – MapReduce principle; similarities to existing material and SQL
    – Scale-Agnostic Data Analytics Platforms
    – Data Scientists need more high-level tools and interfaces than MapReduce
    – Examples: Apache Spark or Apache Flink or Apache Hive
    – Componentized infrastructure: SQL querying, ML-Libraries, Streaming, etc.
    DATA2001 “Data Science, Big Data, and Data Diversity” – 2022 (Roehm) 44
    Learn More
    – DATA3404 Data Science Platforms

https://www.sydney.edu.au/units/DATA2001