Work Hours
Everyday: 北京时间8:00 - 23:59




Cardiff School of Computer Science and Informatics
Coursework Assessment Pro-forma
Module Code: CM3104
Module Title: Large Scale Databases
Lecturer: A.I. Abdelmoty
Assessment Title: NoSQL Coursework
Assessment Number: 1
Date Set: Week 3, Monday 16th October 2023
Submission Date and Time: by 9:30, Week 12, Thursday 11th January 2024
Feedback return date: Thursday 8th February 2024
If you have been granted an extension for Extenuating Circumstances, then the
submission deadline and return date will be later than that stated above. You will be
advised of your revised submission deadline when/if your extension is approved.
If you defer an Autumn or Spring semester assessment, you may fail a module and
have to resit the failed or deferred components.
If you have been granted a deferral for Extenuating Circumstances, then you will be
assessed in the next scheduled assessment period in which assessment for this
module is carried out.
If you have deferred an Autumn or Spring assessment and are eligible to undertake
summer resits, you will complete the deferred assessment in the summer resit
period.
If you are required to repeat the year or have deferred an assessment in the resit
period, you will complete the assessment in the next academic year.
As a general rule, students can only resit 60 failed credits in the summer assessment
period (see section 3.4 of the academic regulations). Those with more than 60 failed
credits (and no more than 100 credits for undergraduate programmes and 105
credits for postgraduate programmes) will be required to repeat the year. There are
some exceptions to this rule and they are applied on a case-by-case basis at the
exam board.
This assignment is worth 50% of the total marks available for this module. If coursework
is submitted late (and where there are no extenuating circumstances):
1 If the assessment is submitted no later than 24 hours after the
deadline, the mark for the assessment will be capped at the minimum
pass mark;
2 If the assessment is submitted more than 24 hours after the deadline, a
mark of 0 will be given for the assessment.
Extensions to the coursework submission date can only be requested using the
Extenuating Circumstances procedure. Only students with approved extenuating
circumstances may use the extenuating circumstances submission deadline. Any
coursework submitted after the initial submission deadline without approved
extenuating circumstances will be treated as late.
More information on the extenuating circumstances procedure and academic regulations
can be found on the Student Intranet:
https://intranet.cardiff.ac.uk/students/study/exams-and-assessment/extenuatingcircumstances
https://intranet.cardiff.ac.uk/students/study/your-rights-and-responsibilities/academicregulations
By submitting this assignment you are accepting the terms of the following declaration:
I hereby declare that my submission (or my contribution to it in the case of group
submissions) is all my own work, that it has not previously been submitted for
assessment and that I have not knowingly allowed it to be copied by another student.
I declare that I have not made unauthorised use of AI chatbots or tools to complete
this work, except where permitted. I understand that deceiving or attempting to
deceive examiners by passing off the work of another writer, as one’s own is
plagiarism. I also understand that plagiarising another’s work or knowingly allowing
another student to plagiarise from my work is against the University regulations and
that doing so will result in loss of marks and possible disciplinary proceedings1.
The use of the following AI assistance ONLY is permitted in this assessment:
Text generation
1 https://intranet.cardiff.ac.uk/students/study/exams-and-assessment/academic-integrity/cheating-andacademic-misconduct
Assignment
Created in 2008, Stack Overflow is a website for questions and answers relating to
programming and other computer science questions. (https://stackoverflow.com/)
Users are required to create an account to ask or answer questions. These accounts
include a chosen screen name and any details that a user wants to share, such as
experience, website, and favoured programming languages. To incentivise participation,
users are awarded with points and badges when asking interesting questions or
providing useful answers. “Upvotes” and “downvotes” can be provided to questions and
answers alike to indicate usefulness. The person who posts a question is also able to
select a single answer as their accepted answer. These metrics are combined and
referred to as the user’s reputation.
You will have access to a sample data set from Stack Overflow. This is publicly available
data that has been collected from 2008. The sample contains 200 users selected at
random, as well as questions, answers, badges, and comments associated with the
selected users. Note that the data is not complete, for example not all questions will
have answers and vice versa, unless they were posted by one of the users selected for
this sample.
A more detailed explanation of the data set is given as an appendix to this document.
Question 1: (25 marks)
In the appendix, you will find instructions to import the dataset provided and to create a
mongo database, called, “database_A”. The database you create can be considered as
normalised, as each of the entities are stored in a separate collection. Please follow the
instructions to create this database.
- Design an alternative “embedded” data model for this dataset, as instructed
below, and create a new database (database_B) that implements this data model
and stores all the data provided.
Your embedded data model should contain the following collections.
- a database collection for questions that stores for each question, the full
record of its accepted answer, if it exists, and a null record if it does not exist. - a database collection for answers, that stores for each answer, the full record
of its related question.
Note that these two collections are only part of the model you will need to store
the data set. You are free to design the rest of the model as you feel appropriate.
In the report describe the designed data model and include the queries you used
to create it and samples from the database to illustrate the created structures.
(7 marks)
- Write the following queries once against database_A and again against
database_B.
In the report, include the query you used, the number of results you obtained and
a sample data set (e.g. limit 20 records) to illustrate the execution of the query.
i. For all users who achieved the “Nice Question” badge after 1st January
2020, find the “Questions” they posted and their corresponding “Accepted
Answer”.
(Project the user id, user display name, badge name, date the badge was
awarded, question id and question title, accepted answer id and answer
score).
(5 marks)
(2.5 marks for database_A and 2.5 marks for database_B)
ii. For the same users in part i, find the “Comments” related to the “Accepted
Answer” for the “Questions” they posted.
(Project the user id, question id, accepted answer id, comment id).
Note that an answer may have many comments associated with it.
(5 marks)
(2.5 marks for database A and 2.5 marks for database_B) - Demonstrate which of the databases; database_A or database_B achieved better
performance when executing queries 2 i and 2 ii above.
In the report, include the queries you used to answer this question.
(3 marks) - The dataset provided for this application is 34MB in size with 200 users, 3,717
questions, 15,665 answers, 22,278 comments, and 11,964 badges.
The size of the Stack Overflow database has grown from 10GB in 2010 to 401GB
now, with nearly 11m users, 18m questions, 27m answers and 75m comments.
Discuss, with reference to scalability, whether your embedded design
(database_B) is suitable for this application.
In your answer, refer to possible modelling options to support a scalable solution
and refer to how sharding can be used for this application.
(5 marks: 3 marks for modelling options and 2 marks for sharding consideration)
Question 2: (25 marks)
Design and create a Neo4J database to store the Stack Overflow dataset provided for
this coursework. Remember to create constraints to ensure your data records are unique
and create indexes for effective search and retrieval. - Write the following queries against the Neo4J database.
In the report, include the query that you wrote to answer the question and sample
answers/graph to illustrate the answers you got.
i. Find all users who achieved the “Fanatic” badge.
(Project the user id, user display name, the badge name and the date the
badge was awarded.)
(2 marks)
ii. For all those users who achieved the “Nice Question” badge after 1st
January 2020, find the “Questions” they posted.
(Project the user id, user display name, the badge name and the date the
badge was awarded, question id and question title.)
(3 marks)
iii. For the same users in part ii, find the “Comments” related to the
“Questions” they posted.
(Project the user id, user display name, the badge name and the date the
badge was awarded, question id, question title and comment id.)
(3 marks)
iv. For all users who achieved the “Inquisitive” badge, find the questions they
posted that have an “Accepted Answer” record in the data set.
(Project the user id, user display name, the badge name, question id,
question title and accepted answer id.)
(5 marks) - For any Two of the queries above, produce a report on the performance of the
query execution. In the report, describe the commands/queries you used to
answer this question and the results you get.
Demonstrate how the use of an index can affect the performance of the query
execution in the database. You need only answer this for one of the queries.
(6 marks: 4 marks for the report on performance of the two queries, 2 marks for
demonstrating performance with an index creation.) - COMSC has its own Stackoverflow: https://stackoverflowteams.com/c/comsc/
There you will find collections representing the different modules, to organise the
presentation of the questions and answers to users;
https://stackoverflowteams.com/c/comsc/collections
Consider how you would update the data model you implemented above to
represent such collections.
Produce an updated design in neo4J to represent these collections. Provide the
queries you use to create the new model to represent the collections and how
they relate to questions and answers in the database.
(6 marks)
Learning Outcomes Assessed - Demonstrate an appreciation of applications of large-scale databases in a variety of
commercial, scientific and professional contexts; - Be able to choose and develop a non-relational database solution suitable for the
type of data and application considered;
Criteria for assessment
Credit will be awarded against the following criteria.
The maximum marks available are 50 marks. Result will be scaled to give a mark out of
100.
Component
&
Contribution
Fail Pass (40-
49%)
2.2 (50 –
59%)
2.1 (60 –
69%)
1st (>= 70%)
Q1.1 (7
marks)
1-2 marks
Attempt to
create the
database not
successful.
Poor or no
explanation
provided
3-4 marks
Database created; may be
incomplete representation of
the data set; may have used
different design from
suggested.
Report may lack detail
5-7 marks
Database created successfully
with the constraints
suggested.
Good report with a full set of
queries to show the model
design.
Q1.2 (10
marks)
1-3 marks
Some attempt
at the solution
is provided,
but attempt is
poor and
shows lack of
understanding
and poor
effort.
4-5 marks
Some attempt at the
questions, but answers may
not be completely correct.
Demonstrates poor to fair
understanding of concepts
and adequate level of effort.
6-7 marks
Good attempt
at all
questions,
with a few
slips or
incomplete
solutions.
Demonstrates
very good
understanding
of concepts
and good
effort.
8-10 marks
Excellent
answer with
correct
solutions in all
parts.
Demonstrates
excellent
understanding
of concepts
and excellent
effort.
Q1.3 (3
marks)
0 marks
No
performance
measurements
are provided.
1 – 2 marks
Answer is partly correct, possibly given for one
query only
—
3 marks
Answer is
complete and
given for both
queries.
Q1.4 (5
marks)
0-1marks
Answer is
either wrong or
majorly
incomplete.
Demonstrates
lack of
awareness of
taught
material and
poor effort.
2 marks
Some
reference to
important
points, but
answer is
sketchy and
demonstrates
poor
awareness of
main
concepts
taught.
-3 marks
Basic answer that references
important points.
Demonstrates awareness of
main concepts taught but can
be sketchy or incomplete or
lack insight and critical
reflection when considering
the problem scenario.
4-5 marks
Excellent
justification
that considers
the
application
scenario.
Demonstrates
an excellent
grasp of the
taught
material and
an effort in
background
reading.
Q2.1 (13
marks)
1-4 marks
Some attempt
at the solution
is provided,
but attempt is
poor and
shows lack of
understanding
and poor
effort.
5-6 marks
Some attempt at the
questions, but answers may
not be completely correct.
Demonstrates poor to fair
understanding of concepts
and adequate level of effort.
7-8 marks
Good attempt
at all
questions,
with a few
slips or
incomplete
solutions.
Demonstrates
very good
understanding
of concepts
and good
effort.
9-13 marks
Excellent
answer with
correct
solutions in all
parts.
Demonstrates
excellent
understanding
of concepts
and excellent
effort.
Q2.2 (6
marks)
0-2 marks
No
performance
measurements
are provided,
or sketchy
attempt at part
of the
question.
3-4 marks
Answer is partly correct, possibly given for one
query only, or missing part related to indexes.
5-6 marks
Answer is
complete and
given for both
queries and
index
performance.
Q2.3 (6
marks)
0-2 marks
Some attempt
is made at the
updated
design, but the
answer is
3-4 marks
Good solution that addresses the problem, but
may not be fully justified.
5-6 marks
Excellent
design that
addresses the
problem fully.
either wrong or
majorly
incomplete.
Demonstrates
lack of
awareness of
taught
material and
poor effort.
Demonstrates
an excellent
grasp of the
taught
material and
an effort in
background
reading.
Feedback and suggestion for future learning
Feedback on your coursework will address the above criteria. Feedback and marks will
be returned within 20 working days from submission deadline via Learning Central.
Feedback from this assignment will be useful to understand limitations of current
knowledge, technical and communication abilities related to the subject of the module.
Submission Instructions
Description Compulsory? Type Name
Assessment_1 Yes One PDF (.pdf) file
comprising your answer to
all questions
[student number]-CM3104-
report.pdf
Assessment_2 Yes One text file (.txt)
comprising the queries you
used for all questions
[student number]-CM3104-
queries.txt
Any code submitted will be run on a system equivalent to those suggested for use with
the lab exercises and must be submitted as stipulated in the instructions above.
Any deviation from the submission instructions above (including the number and types of
files submitted) will result in a mark of zero for the assessment.
Staff reserve the right to invite students to a meeting to discuss coursework submissions
Support for assessment
Questions about the assessment can be asked on the Discussion Forum on LC and in
the online lecture slots dedicated for coursework support, as will be explained by the
lecturer in due course.
Support for the programming elements of the assessment will be available in the lab
classes, offered weekly throughout the semester.
CM3104: Large-Scale Databases
School | Cardiff School of Computer Science and Informatics |
Department Code | COMSC |
Module Code | CM3104 |
External Subject Code | 100754 (databases) |
Number of Credits | 20 |
Level | Level 6 UG Degree (L6) |
Language of Delivery | English |
Module Leader | Dr ALIA ABDELMOTY |
Semester | Autumn Semester SEM1 |
Academic Year | 2023/4 |
Outline Description of Module
This module explores a range of database technologies that have been motivated by the demands of applications that create massive volumes of data with rapidly changing data types – structured, semi-structured and unstructured data. For example, management of location and geo-spatial information has resulted in extensions to conventional relational databases that can be supported by object-relational database systems. Access to massive quantities of social, scientific and commercial data on the web has resulted in more radical departures from the relational data model. The module introduces the modelling and management of large-scale datasets with a range of modern database technologies, including NoSQL document and graph databases.
On completion of the module a student should be able to
- Demonstrate an appreciation of applications of large-scale databases in a variety of commercial, scientific and professional contexts;
- Discuss how relational databases are extended with object-relational technologies to support management of spatial information;
- Understand the characteristics of and methods of processing geospatial information for purposes of storage and retrieval;
- Describe non-relational database approaches including document and graph databases to support access to large data sets;
- Be able to choose and develop a non-relational database solution suitable for the type of data and application considered;
How the module will be delivered
Modules will be delivered through blended learning. You will be guided through learning activities appropriate to your module, which may include:
- on-line resources that you work through at your own pace (e.g. videos, web resources, e-books, quizzes),
- on-line interactive sessions to work with other students and staff (e.g. discussions, live streaming of presentations, live-coding, team meetings)
- face to face small group sessions (e.g. help classes, feedback sessions)
Skills that will be practised and developed
Practice creating and querying NoSQL databases using MongoDB;
Model and query data in Graph databases using Neo4J;
Use of spatial SQL language to store and retrieve spatial/geographic data.
Basic geoparsing methods in Python for place name entity recognition and for geocoding place names.
How the module will be assessed
A blend of assessment types which may include coursework and portfolio assessments, class tests, and/or formal examinations
Students will be provided with reassessment opportunities in line with University regulations.
Assessment Breakdown
Type | % | Title | Duration(hrs) |
---|---|---|---|
Written Assessment | 50 | Coursework | N/A |
Exam online – Autumn semester | 50 | Large-Scale Databases | 2 |
Syllabus content
Review of applications that require support for massive quantities of data, with reference to Cloud Computing and to Big Data.
Non-relational database management methods (NoSQL) for access to large distributed datasets.
Spatial databases for Geographical Information Systems (GIS), including spatial data models and spatial extensions of SQL.
Methods for indexing spatial data and textual data and methods for geo-referencing documents to support spatio-textual indexing.
Copyright Cardiff University. Registered charity no. 1136855