BUFN758A Text Mining for Economics and Finance | Maryland马里兰大学 | homework代写 | python .ipynb jupyter notebook代写

Text Mining for Economics and Finance
Lecture 5.2: Word Embeddings Estimation
Paul E. Soto, Ph.D.1
Robert H. Smith School of Business
University of Maryland
1Opinions expressed in this presentation are those of the instructor and not necessarily those of the FDIC
Paul E. Soto Lecture 5.2 1 / 30
Agenda
Difference between Skip Gram and CBOW models
Estimating Word2Vec with Gensim
Understand the main parameters such as size, window, sg, min_count
Reducing the dimensions further of the word embeddings using t-SNE
Creating relationships or analogies using the most_similar method
Paul E. Soto Lecture 5.2 2 / 30
Recap of Word Embeddings
Paul E. Soto Lecture 5.2 3 / 30
Recap of Word Embeddings
We can represent words with one-hot encoded vectors but this is not useful!
Paul E. Soto Lecture 5.2 4 / 30
Recap of Word Embeddings
So, we estimate word embeddings
Unsupervised model
Creates vector representations of words
Distances in the vector space represent syntactic and semantic similarities
Estimated by setting up prediction exercises
Skip Gram Model: use target word to predict context words
Continuous Bag of Words (CBOW): use context words to predict target
Let’s look at one training exercise….
Paul E. Soto Lecture 5.2 5 / 30
Recap of Word Embeddings
Paul E. Soto Lecture 5.2 6 / 30
Recap of Word Embeddings
Estimation strategy
First, create a random U and beta matrices
For each word in every sentence, predict the target word (depending on Skip Gram or
CBOW)
Using backpropogation, shift U and beta matrices to improve prediction
Do this enough times until predictions meet a threshold
Paul E. Soto Lecture 5.2 7 / 30
Recap of Word Embeddings
Both U and β represent the word embeddings
U =


u
and
1
u
uncertainty
1
u
fears
1
· ·· u
think
1
u
and
2
u
uncertainty
2
u
fears
2
· ·· u
think
2
u
and
3
u
uncertainty
3
u
fears
3
· ·· u
think
3

 =

uand uand · · · uthink
Let’s plot these word embeddings…
Paul E. Soto Lecture 5.2 8 / 30
Recap of Word Embeddings
Paul E. Soto Lecture 5.2 9 / 30
Application of Word Embeddings
Paul E. Soto Lecture 5.2 10 / 30
Application of Word Embeddings
Enron Corporation Accounting Scandal
American energy company based in Houston, Texas
Founded in 1985
Over years, expanded to trading
Hid financial troubles using dubious accounting loopholes/practices
E.g. “mark-to-market accounting”
December 2, 2001, Enron filed for bankruptcy (nearly $60 billion in assets)
Scandal led to Sarbanes-Oxley Act in 2002
Dataset
500,000 emails generated by Enron employees
Collected by the Federal Energy Regulatory Commission during investigation
0.5% sample randomly drawn for this exercise
Complete dataset available at https://www.cs.cmu.edu/ ./enron/
Paul E. Soto Lecture 5.2 11 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 12 / 30
Word Embeddings in Python
Enron Email Dataset (As published at https://www.cs.cmu.edu/ ./enron/)
Paul E. Soto Lecture 5.2 13 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 14 / 30
Word Embeddings in Python
size: dimension of the word embeddings
sg: whether or not you want to estimate the Skip Gram model or the CBOW
min_count: Ignores all words with total frequency lower than this
window: The maximum distance to use for predictions
Paul E. Soto Lecture 5.2 15 / 30
Word Embeddings in Python
Dimension Reduction using t-SNE
Paul E. Soto Lecture 5.2 16 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 17 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 18 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 19 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 20 / 30
Word Embeddings in Python
Paul E. Soto Lecture 5.2 21 / 30
Word Embeddings in Python
Relationships and Analogies in H-Dimensional space
vec(vegas) − vec(nv) + vec(phoenix) =?
vec(az)
vec(az) − vec(phoenix) + vec(mo) =?
vec(louis)
Paul E. Soto Lecture 5.2 22 / 30
Word Embeddings in Python
Relationships and Analogies in H-Dimensional space
vec(vegas) − vec(nv) + vec(phoenix) =?
vec(az)
vec(az) − vec(phoenix) + vec(mo) =?
vec(louis)
Paul E. Soto Lecture 5.2 22 / 30
Word Embeddings in Python
Relationships and Analogies in H-Dimensional space
vec(vegas) − vec(nv) + vec(phoenix) =?
vec(az)
vec(az) − vec(phoenix) + vec(mo) =?
vec(louis)
Paul E. Soto Lecture 5.2 22 / 30
Now What?
Am I ever going to use this in “real” life?
Paul E. Soto Lecture 5.2 23 / 30
Text Mining is Part of the Future
Today’s Gold = Data
Paul E. Soto Lecture 5.2 24 / 30
Text Mining Jobs
Paul E. Soto Lecture 5.2 25 / 30
Text Mining Jobs
Paul E. Soto Lecture 5.2 26 / 30
Text Mining Jobs
Paul E. Soto Lecture 5.2 27 / 30
Text Mining Jobs
What about the job that doesn’t exist yet?
Paul E. Soto Lecture 5.2 28 / 30
Text Mining Jobs
Skills Learned
Basic understanding of using Python
Handling dataframes, plotting, opening text files, basic operations
Normalizing text through pre-processing
Find out which words are rare or common with TF-IDF matrix
Automatically classify text as Good/Bad, Hawkish/Dovish, etc.
Naive Bayes Classifier/Support Vector Machines
Discover topics among a set of unannotated documents
Topic Modeling
Model words in such a way to preserve semantic meanings
Word embeddings/Word2Vec
Thinking above and beyond project expectations…
Paul E. Soto Lecture 5.2 29 / 30
That’s All Folks
Good Luck with the Final Project and Final Exams!
Paul E. Soto Lecture 5.2 30 / 30

https://networth.rhsmith.umd.edu/courses/bufn-758a