ARCHITECTURE FOR AUTOMATIC POETRY GENERATION THROUGH
PATTERN RECOGNITION

by

Kimberly Scofield
B.Sc., University of Northern British Columbia, 2011

THESIS SUBMITTED IN PARTIAL FULFILLMENT OF
THE REQUIREMENTS FOR THE DEGREE OF
MASTER OF SCIENCE
IN
INTERDISCIPLINARY STUDIES

UNIVERSITY OF NORTHERN BRITISH COLUMBIA
April 2017

© Kimberly Scofield, 2017

Abstract

Document representation and topic modelling are important problems for artificial intelligence researchers, with applications ranging from education technology to
bioinformatics. Many approaches have been proposed, the majority falling broadly
into categories of Statistical Analysis and Natural Language Processing (NLP). This
thesis proposes an architecture that optimizes a combination of statistical and linguistic analysis in an unsupervised machine learning environment.
The proposed architecture is a design for agile, stable, document modelling. By
clustering within the statistical inference algorithm, it reduces the computational
cost of time and space associated with conventional classifying algorithms such as
K-means, increasing the threshold for size and frequency of aggregate data analysis.
This translates to an increased stability for evolution of learning. The architecture
builds on the concept of socio-linguistic connections as an inherent combination of
statistics and linguistics, and employs well-researched concepts of statistical and
linguistic analysis, including embedded sub-manifold analysis. It optimizes both
linguistic connections and computational cost.
Trials are run with three sets of parameters, and results distributed for volunteer
evaluation. Feedback from the evaluation indicates that the proposed architecture
produces groups of sentences (poems) with a high degree of social acceptance and
response.

ii

Contents
Abstract

ii

List of Figures

iv

List of Tables

vii

Acknowledgement

viii

1 Introduction

1

1.1

Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

2 Background and Related Work

5

2.1

Linguistic Integrity . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

2.2

Statistical Processing Algorithms . . . . . . . . . . . . . . . . . . . .

9

2.2.1

VSM: Vector Space Model . . . . . . . . . . . . . . . . . . . .

9

2.2.2

LSI: Latent Semantic Indexing . . . . . . . . . . . . . . . . . . 11

2.2.3

PLSI: Probabilistic Latent Semantic Indexing . . . . . . . . . 12

2.2.4

LDA: Latent Dirichlet Allocation . . . . . . . . . . . . . . . . 15

2.2.5

LapPLSI: Laplacian Probabilistic Latent Semantic Indexing . 16

2.3

Similarity Measurements for Clustering . . . . . . . . . . . . . . . . 20

iii

2.3.1

Centroid-based Clustering . . . . . . . . . . . . . . . . . . . . 21

2.3.2

Distribution-based clustering . . . . . . . . . . . . . . . . . . . 23

2.4

Natural Language Processing Algorithms . . . . . . . . . . . . . . . 25

2.5

Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3 Architecture of the Automatic Generator

36

3.1

The Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2

The Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.3

The Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4 Implementation Detail

48

4.1

Block 1: Create the dictionary and the data . . . . . . . . . . . . . . 48

4.2

Block 2: Create viable sentences . . . . . . . . . . . . . . . . . . . . 49

4.3

Block 3: Distribution-based Analysis . . . . . . . . . . . . . . . . . . 51

4.4

Block 4: Natural Language Processing (NLP) Analysis . . . . . . . . 53

4.5

Block 5: Evolve the HCI Cognitive Response . . . . . . . . . . . . . 54

5 Experimental results and analysis

55

5.1

Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2

Results of LapPLSI analysis in Block 3 . . . . . . . . . . . . . . . . 58

5.3

The Poems: Validation and feedback . . . . . . . . . . . . . . . . . . 70

6 Conclusions and Future Work

73

Bibliography

77

A Glossary

81

B Poems: a mixed selection

83

iv

List of Figures
2.1

Singular Value Decomposition (SVD) . . . . . . . . . . . . . . . . . 11

2.2

Affine Space example: Real numbers in 3 dimensions . . . . . . . . 17

2.3

Manifold geodisics: a visual representation of intrinsic geometry

2.4

K-Means clustering: Voroni Partitioning . . . . . . . . . . . . . . . . 22

2.5

K-Means clustering of the Iris Dataset . . . . . . . . . . . . . . . . . 23

2.6

Normal Distribution: LapPLSI versus Gaussian . . . . . . . . . . . . 24

2.7

Cluster: Gaussian example . . . . . . . . . . . . . . . . . . . . . . . 24

3.1

Block 1: Create dictionary. Generate text . . . . . . . . . . . . . . . 43

3.2

Block 2: Create Viable Sentences . . . . . . . . . . . . . . . . . . . . 44

3.3

Block 3: Inference Analysis . . . . . . . . . . . . . . . . . . . . . . . 45

3.4

Block 4: Natural Language Processing (NLP) Analysis . . . . . . . . 46

3.5

Block 5: Fine-tune sentence creation and selection. . . . . . . . . . . 47

5.1

5 clusters, 100 Aggregated files . . . . . . . . . . . . . . . . . . . . . 58

5.2

5 clusters, 300 Aggregated files . . . . . . . . . . . . . . . . . . . . . 59

5.3

5 clusters, 600 Aggregated files . . . . . . . . . . . . . . . . . . . . . 60

5.4

5 clusters, 1200 Aggregated files . . . . . . . . . . . . . . . . . . . . 61

5.5

7 clusters, 100 Aggregated files . . . . . . . . . . . . . . . . . . . . . 62

5.6

7 clusters, 300 Aggregated files . . . . . . . . . . . . . . . . . . . . . 63

v

. . 19

5.7

7 clusters, 600 Aggregated files . . . . . . . . . . . . . . . . . . . . . 64

5.8

7 clusters, 1200 Aggregated files . . . . . . . . . . . . . . . . . . . . 65

5.9

10 clusters, 100 Aggregated files . . . . . . . . . . . . . . . . . . . . 66

5.10

10 clusters, 300 Aggregated files . . . . . . . . . . . . . . . . . . . . 67

5.11

10 clusters, 600 Aggregated files . . . . . . . . . . . . . . . . . . . . 68

5.12

10 clusters, 1200 Aggregated files . . . . . . . . . . . . . . . . . . . . 69

5.13

Results of poem feedback . . . . . . . . . . . . . . . . . . . . . . . . 72

vi

List of Algorithms
2.1

Generalized EM Algorithm for PLSI . . . . . . . . . . . . . . . . . . 14

2.2

Generalized EM Algorithm: Modified Steps for LDA . . . . . . . . . 15

2.3

Manifold Assumption Overview . . . . . . . . . . . . . . . . . . . . . 20

4.1

Generalized EM Algorithm for LapPLSI . . . . . . . . . . . . . . . . 52

List of Tables
2.1

First Order Vectors of Unigrams . . . . . . . . . . . . . . . . . . . . 10

2.2

First Order Co-occurrence Matrix: vectors of bigrams . . . . . . . . 27

2.3

Second Order Co-occurrence Matrix, optimized with SVD . . . . . . 27

2.4

Second Order Co-occurrence: sentence one vectors . . . . . . . . . . 28

2.5

Second Order Representation: sentence one vectors . . . . . . . . . . 29

3.1

Document-Term Matrix, average stats . . . . . . . . . . . . . . . . . 37

3.2

Document-Term Matrix: 600 documents, average stats . . . . . . . . 37

vii

Acknowledgement

First, and most important, I wish to acknowledge the support and encouragement
I have received from my family while on this journey. Their belief in me and their
unfailing support has been invaluable. I also want to thank my friends for their
support and encouragement through everything, and for sticking around even when
my time with them was limited. I couldn’t have done this without all of you - thank
you for never letting me doubt myself!
Next, I wish to acknowledge the guidance, persistence, and encouragement of my
co-supervisor, Dr. Liang Chen. His belief in my project has been both unfailing and
encouraging; his belief in me has been inspiring. Thank you also to my co-supervisor
Dr Charles Brown, whose love of linguistics and belief in me was the start of my
journey, and whose guidance showed me which parts of the field I love most. Thank
you both for everything you have done for me over the last few years.

viii

Chapter 1
Introduction
1.1

Overview

This chapter provides an overview of the thesis topic, as well as the proposed architecture and methodology. Document representation and topic modelling are key
components in finding relevant, meaningful, linguistic relationships within a text.
The purpose of this project is to look at these relationships from a perspective
di↵erent than the conventional document-term joint-probability analysis of a large
corpus, and focus instead on the advantages to be gained from analysis and modelling with smaller ”documents” (sentences, in this case), sociolinguistic feature sets,
and statistical analysis that goes beyond Euclidean space. The result is an environment that optimizes the discovery of linguistic connections, while reducing the
computational cost of space and time.

Methodology
• Linguistics and semantics are defined within a relaxed structure. There is no
conventional part-of-speech-tagging, and only broad structural categories to
1

guide word placement and choice.
• The native dictionary can be adapted to target specific baselines, such as
reading level.
• The statistical / Natural Language Processing (NLP1 ) order of analysis, and
the specific methods used, have been carefully chosen to preserve and promote
linguistic meaning while at the same time minimizing matrix size and sparsity.
• Conventional clustering is not used: clustering is accomplished by the statistical algorithms in the embedded submanifold code, reducing computational
cost (space and time) and allowing an increased threshold for stable evolution
of learning.

Architecture
• Pluggable architecture: the customizable semantic template and dictionary
can accommodate di↵erent languages and reading levels.
• Block style components: each section, or block, executes one discrete part of
the process, allowing changes to be made and the e↵ects tracked.
• Adjustable parameters: a native toolkit to integrate block functionality allows
coarse- and fine-grained optimization at many levels, to adjust output for
di↵erent ages, abilities, and output goals.

Why Poetry?
Poetry o↵ers a viable model of the human creative process, linking thought process
to emotive expression. This is the crucial area of information selection: identifying
1

Natural Language Processing

2

the data and parameters that most efficiently and accurately bridge a humanized
extrapolation from data, to information, and finally to knowledge.
Free form poetry does this in a landscape of (comparatively) relaxed syntactic
structure, allowing expression, abstraction and extrapolation to be of higher priority
than form. As Flores states in [21] Understanding Computers and Cognition: A
New Foundation for Design: ”one of the most prominent illusions... is the belief
that knowledge consists of formal theories that can be systematically used to make
predictions.” The relaxed structure of free form poetry allows the algorithm to create
a series of abstract associations, building on the inherent cultural and linguistic
markers present in the sentence structure and the dictionary. The goal is to emulate
not the process of human thought but the result.

1.2

Contribution

The main contribution of this thesis is a design for unsupervised, agile, stable document modelling, simulated within an architecture for the automatic generation of
poetry. This architecture combines well-researched methods of statistical and NLP
analysis in a structure of independent blocks. Each block uses customized parameters specifically chosen to optimize extraction of information that is culturally and
socially relevant, as well as optimizing computational cost.
The goal of the project is to create a framework capable of generating poetry
that will pass as human-created. To accomplish this goal, the proposed architecture
builds on the concept of sociolinguistic connections as a natural combination of
statistical probability, linguistics, and relevant vocabulary.
The architecture creates simulation data using a pseudo-random generated set
of sentences and a custom dictionary. Dimensionality reduction is done first, to

3

preserve inherent and/or hidden semantic integrity while at the same time reducing
noise and sparsity. This is followed by statistical analysis within an embedded
submanifold model [2]. Linguistic analysis techniques are then used to cluster groups
of semantically related sentences.
This architecture was validated in the context of a modified Turing Test: volunteers read and evaluated a a set of unlabelled poems, created by both human poets
and the Automatic Generator. The results show that the architecture produces
sentences and groups of sentences (poems) with a high degree of social acceptance,
concept fluidity, and information extrapolation. Of the varied number of clusters
examined, trials using 5 clusters were found to be the most e↵ective. These were
run on files of 3000-3500 sentences, each sentence consisting of between 3 and 12
words.
The organization of the remaining sections is as follows: Chapter 2 covers background and related work. Chapter 3 is a general overview of the proposed architecture and its methodology. Chapter 4 is a detailed look at each of the architecture’s
components/blocks. Chapter 5 provides experimental results and analysis. Chapter
6 concludes the thesis with a summary of the concepts presented, and some thoughts
and ideas for future work.

4

Chapter 2
Background and Related Work
This chapter presents background information and an overview of previous work in
document representation and topic modelling, statistical and NLP1 analysis, and
sociocultural linguistics.
Document representation and topic modelling encompass broad fields of application, and many di↵erent implementation strategies within these fields. For the
purpose of this project we focus on document representation as a means of identifying relevant structures within a text document (punctuation, words, document
begin/end points, etc) and topic modelling as the clustering of conceptually connected words, and groups of words, across all documents within a corpus.
For researchers in this field the question has always been: how to create a robust, self-evolving model that can dynamically adapt to random user input while
producing a relative, pragmatic output aggregated across all data, both historical
and current.
1

Natural Language Processing

5

2.1

Linguistic Integrity

It is important to note that this project has no interest in teaching computers to use
language as a human would use it. Rather, this project is interested in a method
to draw conclusions from human communication and to extrapolate these conclusions. Linguistic integrity addresses the creation of appropriate, pseudo-random,
input data that is carefully structured to simulate a reasonable baseline learning environment. For this we have precedent from both the computational and linguistics
community:
[12] George Luger, Artificial Intelligence:
“Digital computers are not merely a vehicle for testing theories of intelligence. Their architecture also suggests a specific paradigm for such
theories: intelligence is a form of information processing. The notion of
a problem-solving methodology, for example, owes more to the sequential nature of computer operation than it does to any biological model
of intelligence.” (p 12)
[21] Fernando Flores, Understanding Computers and Cognition:
“computers will remain incapable of using language in the way human
beings do, both in interpretation and in the generation of commitment
that is central to language.” (p 12)
[9] Dan Jurafsky and James Martin, Speech and Language Processing:
“The meaning of entities, and the meaning of grammatical relations
among them, is related to the restriction of combinations of these entities
relative to other entities (p 13, from Zellig Harris, 1968)”

In other words... “You shall know a word by the company it keeps.” [15] (Ted
Pedersen’s paraphrase of Harris, 1954 and Firth, 1957)

6

Maximizing the retention and visibility of the underlying meaning that is inherent in the data means creating simulation data with a balance between a complete
lack of syntactic structure (no word order, gibberish) and a basic word order, as
could conceivably be acquired when someone learns a new language by listening to
others speak. A computer could arguably learn these same basic word placements
by analyzing word patterns in documents - in e↵ect “learning” rudimentary word
placement. This new-language-learner semantic structure has been simulated with
a custom dictionary and a modified Sci-Gen [20] document template.
On these concepts both sides of the “nature versus nurture” crowd can play nicely
together in some instances: “theories of cognition can deal purely with ‘competence’,
characterizing the behavior of HCI cognitive systems while making no hypothesis
concerning the generation of that behavior by mechanisms.” (Noam Chomsky, as
cited in [21] Understanding Computers and Cognition: A New Foundation for Design).
Linguistics is the scientific study of language and its structure, including the
study of morphology, syntax, phonetics, and semantics. Computational linguistics
is that branch of linguistics in which the techniques of computer science are applied to the analysis of language. A sociolinguistic approach, in combination with
the new-language-learner vocabulary and semantic structure, allows the AutoGen
architecture to see not only valid, but also culturally relevant, patterns (concept
connections) in the text.
As humans, each of us has access to a vast internal storehouse of subtle, subconscious, language-related knowledge: a synthesis of experience, social interaction, and
personal reflection. Linguistics is a tool that we can use to tap into this constantly
evolving knowledge base.
In [21] Understanding Computers and Cognition Flores states: “language, and
7

therefore thought, is ultimately based on social interaction.” Social interaction is
the vital link that transforms information into usable knowledge, knowledge that
is both socially and culturally relevant. Supervised machine learning (training sets
that determine weights) is one way to simulate a baseline for social interaction.
This project uses unsupervised machine learning and looks at social interaction
from a di↵erent perspective: the perspective of patterns.
Patterns are the basis of computational linguistics: recurrence equates to patterns. Computers do well as “devices for facilitating human communication in language... [they] observe and describe regular recurrences” [21]. Therefore, linguistic
meaning can be related to language structure as pattern and pattern recurrence.
The proposed architecture does not link speech act and listener, it links a pattern
of speech act recurrence, or a “pattern of acts through time.” [21] and by using the
embedded submanifold protocol of [2] Deng Cai et al, we can bring visibility to 3+
dimensions of contextual meaning.
New studies show there is a high likelihood that humans do not store knowledge,
that we create, instead, well-used neural pathways that function as ‘memory recall’.
By this theory, a ‘right decision’, such as recalling our route from home to work,
is no more than the execution of a well-used neural path, or a neural path of least
resistance [4]. Identifying conceptual patterns in text can also be seen to result from
repetition (patterns) rather than accumulated knowledge. Echoing the earlier quote
from both Harris and Firth, we can see that to know a word by the company it
keeps (by its associations, placement, and frequency of occurrence) is equivalent to
recognizing its pattern of recurrence.
Shared assumptions and cultural anomalies also become less relevant when we
add:
“focus should be on the interactions within the system as a whole, not
8

on the structure of perturbations. The perturbations do not determine
what happens in the nervous system, but merely trigger changes of state.
“interactions and transformations continuously regenerate the network
of processes (relations) that produced them”. [21] (Flores and Winograd,
(quoted from: Maturana and Varela, Autopoiesis and Cognition, 1980,
p 19))

And:
We consciously register the results of our understanding and thinking
... but not the understanding and thinking processes themselves; and
these symbolic abstractions, to the extent that they lack quantitative
or probabilistic dimensions, can lead us to suppose that the underlying
processing is nonquantitative as well. But the successes of statistical
NLP2 , as well as recent developments in cognitive science (e.g., Fine et al.
2013; Tenenbaum et al. 2011; Chater and Oaksford 2008) suggest that
language and thinking are not only symbolic, but deeply quantitative
and in particular probabilistic. [19] (from The Stanford Encyclopedia of
Philosophy)

In other words successful simulation of human thought relies more on understanding what is happening, than on why it is happening.

2.2

Statistical Processing Algorithms

2.2.1

VSM: Vector Space Model

VSM creates a document-term matrix with each document represented as a “bag
of words”, or sparse term-frequency vector, in which the order of the words is not
important [17]:
2

Natural Language Processing

9

The student runs an event.
Students study running.
The student’s car runs fast.
Table 2.1: First Order Vectors of Unigrams
stud

run

event

car

fast

sent-1

1

1

1

0

0

sent-2

2

1

0

0

0

sent-3

1

1

0

1

1

Table 2.1 shows a sparse first order co-occurrence matrix with each document
(row) viewed as a term-frequency vector.
A common method for representing the similarity of documents is the cosine
similarity measurement: the cosine of the angle between two document vectors
is calculated and similarity is expressed as the normalized inner product of this
calculation:
sim(d1 , d2 ) =

V(d1 ) · V(d2 )
|V(d1 )||V(d2 )|

(2.1)

VSM has severe drawbacks, due mainly to the ambiguity of words (polysemy),
individual di↵erences in word usage (synonymy), and variations in personal writing
style. VSM simply counts the number of occurrences of each word and therefore
does not capture subtle di↵erences in meaning based on context; for example, “ a
lean person” and “a leaning building”.

10

2.2.2

LSI: Latent Semantic Indexing

Latent Semantic Indexing (LSI), also called Latent Semantic Analysis (LSA), was
first proposed by Bellcore Laboratories in the late 1980’s 3 .
LSI was developed to improve VSM performance and uses Singular Value Decomposition (SVD) to reduce the dimensionality of the document-term matrix, and
increase visibility of the documents’ latent semantic subspace [8]. This process of
matrix reduction includes the amalgamation of similar words and word combinations, e↵ectively reducing the ambiguity prevalent in the VSM model.
Figure 2.1: Singular Value Decomposition (SVD)

Statistically, the n-highest singular values from matrix A can be used to produce
the best n-rank representation of the data with the equation:

A = U ⌃VT

(2.2)

SVD performs best with normally-distributed data, as would be represented by
a square “Matrix A” but, in general, a natural language matrix is not square and
3

G.W. Furnas, T.K. Landauer, L.M. Gomez, S.T. Dumais, The Vocabulary Problem in HumanSystem Communication: an Analysis and a Solution, Bell Communications Research, 1987

11

its data is not normally distributed. The SVD algorithm used in LSI has been generalized to accept input data from a non-square matrix but it comes with a caveat:
the LSI model itself works best when used to identify semantic similarity between
documents that do not appear, on the surface, to be similar. LSI performance is
degraded if documents are homogeneous (similar language, dialect, subject matter,
vocabulary), as is common in many natural language settings [14].
SenseClusters is an example of an LSI model that uses SVD for concept clustering. It is a graduate project created by Professor Tom Pedersen and his students
at the University of Minnesota: “SenseClusters was implemented at the University of Minnesota, Duluth by Amruta Purandare (2002-2004) and Anagha Kulkarni
(2004-2006), with support for Latent Semantic Analysis being added by Mahesh
Joshi in the summer of 2006.”4 [15]. SenseClusters uses native methods and LSI to
cluster sentences in a semi-supervised machine learning environment. It uses WordNet5 definitions for weights and disambiguation, and a tagged benchmark corpus
for testing.

2.2.3

PLSI: Probabilistic Latent Semantic Indexing

PLSI is a generative probabilistic model used to maximize the joint probability of
documents and words in a corpus. PLSI does this by estimating the probability
distribution for each document independently; therefore, the number of parameters
increases linearly with the number of documents. This leads to overfitting: an
excess of parameters relative to the number of documents. This noisy environment
can exaggerate minor fluctuations in the data and obscure concept connections.
Below is the PLSI algorithm pseudocode from the paper “Modeling Hidden Top4
5

http://senseclusters.sourceforge.net/
tagged online dictionary, http://wordnet.princeton.edu/

12

ics on Document Manifold” by [2] Deng Cai et al, showing the modification in step
2 that is the key to avoiding overfitting in Algorithm 2.16 :
6

/: proportional to

13

Algorithm 2.1 Generalized EM Algorithm for PLSI
Variables:
N = number of documents
K = number of latent topics
M = number of terms (words) in the dictionary
Steps:
1. select a document di with probability P (di )
2. pick a latent topic zk with probability P (zk |di )
3. generate a word wj with probability P (wj |zk )
Result:
observation pair (di , wj ) (latent topic variable zk is discarded)
Joint probability representation:
P (di , wj ) = P (di )(wj |di )

(2.3)

P (wj , di ) = ⌃K
k=1 P (wj |zk )P (zk , di )
Estimate the parameters: maximize the log-likelihood L:
M
L = ⌃N
i=1 ⌃j=1 n(di , wj ) logP (di , wj )

(2.4)

M
K
/ ⌃N
i=1 ⌃j=1 n(di , wj ) log⌃k=1 P (wj |zk )P (zk |di )

This results in N K + M K parameters of the form:
P (wj |zk ), P (zk |di )
which are independently estimated in the PLSI model. The number of parameters
in PLSI therefore grows linearly with the number of training documents (N ), and
leads to overfitting.
14

2.2.4

LDA: Latent Dirichlet Allocation

A Dirichlet distribution is a Probability Distribution Function,(PD ) formulated to
be a ‘measure of measures’. LDA avoids overfitting by modelling the probability
distribution of each document, over topics, as a random variable K, where K is
the number of hidden topics. Each k in K is a discrete Probability Distribution
Function, used to calculate ↵i

↵k and ↵K is the master PD for K.

Below is the LDA algorithm pseudocode from the paper “Modeling Hidden Topics on Document Manifold” by [2] Deng Cai et al, showing the modification in step
2 that is the key to avoiding overfitting in Algorithm 2.17 :
Algorithm 2.2 Generalized EM Algorithm: Modified Steps for LDA
Steps:
1. select a document di with probability P (di )
2. pick a latent topic zk
2.1 generate ✓i v Dir(↵)
2.2 pick a latent topic zk with probability P (zk |✓i )
3. generate a word wj with probability P (wj |zk )
After step 3:
Joint probability representation: Equation 2.3 (PLSI)
Parameter estimation: maximize the log-likelihood Equation 2.4 (PLSI)
This results in K + M K parameters of the form:
P (wj |zk )
The number of parameters, therefore, does not grow linearly with the number of
7

v: distributed according to the distribution

15

documents and overfitting is not an issue in LDA.
Both PLSI and LDA discover the hidden topics in the Euclidean space. However,
Euclidean space is flat and recent studies suggest that a non-linear, low-dimensional
manifold, embedded in high-dimensional ambient space [10][23], is a more accurate
visualization of a document space. Exploitation of the local geometric structure is
essential to reveal the hidden semantics in this high-dimensional space [2] and neither
PLSI nor LDA exploit the geometric structure of the document representation. A
manifold, then, is a logical next step.

2.2.5

LapPLSI: Laplacian Probabilistic Latent Semantic Indexing

“LapPLSI models the document space as a submanifold embedded in the ambient
space and directly performs the topic modelling on this document manifold in question.” [2]. This model is only relevant and useful because there exists an identifiable
relation (a connection) between:
PD and the Conditional Probability Distribution P (z|d)
This connection is explained below in “The Manifold Assumption”, and allows LapPLSI to build on the LDA concept of creating a ‘measure of measures’, or a ‘distribution of distributions’, and thereby discover the intrinsic geometrical structure of
the document space.
Embedded submanifold: a quick visual: Simplistically, a manifold exists
in an affine space, a space that generalizes properties of Euclidean space that are
independent of the measurement of distance and angles: in the graphic below, P2
is not a vector subspace of R3 , but it is a linear substructure from which relative

16

measurements of vectors a and b can be made.8
Figure 2.2: Affine Space example: Real numbers in 3 dimensions

To discover the intrinsic geometry, LapPLSI uses a “geometrically based regularizer” (new variable created for the LapPLSI algorithm) and an assumption that
there is a relevant, useful, and identifiable relation between the overarching probability distribution PD and the conditional probability distribution(s) P (z|d) (the
probability of specific words). In other words, a relation that connects the ambient
space and the submanifold. This is an known as the Manifold Assumption. The
regularizer and Manifold Assumption are explained below.
The Geometric Regularizer is a variable that generalizes the ExpectationMaximization (EM) step of LDA (Algorithm 2.2) to 3+ dimensions. The maximum
log-likelihood step (Equation 2.4) becomes a regularized log-likelihood ⇠:
8
Graphic by Jan Krieg, https://commons.wikimedia.org/w/index.php?curid=45756454, CC
BY-SA 4.0

17

⇠=L

R=L

k
⌃K
k=1 R

M
K
/ ⌃N
i=1 ⌃j=1 n(di , wj ) log⌃k=1 P (wj |zk )P (zk |di )

2
where

N
⌃K
k=1 ⌃j=1 (P (zk |di )

(2.5)

2

P (zk |di )) Wij

is the Geometric Regularization parameter

This results in N K + M K parameters of the form:
P (wj |zk ), P (zk |di )
Manifold assumption: is a method of generalizing vector similarity measurements to 3+ dimensions. In the same vein as the LDA assumptions outlined in
Section 2.2.4 in the manifold assumption we assume:
IF: two documents d1 , d2 2 D are close in the intrinsic geometry of PD
THEN: their conditional probability distributions P (z|d1 ) and P (z|d2 ) are similar
In other words, we assume that the conditional probability distribution P (z|d)
(probability of latent variable z, given word d) varies smoothly along the geodesics
in the intrinsic geometry of PD .

18

Figure 2.3: Manifold geodisics: a visual representation of intrinsic geometry

This assumption is used extensively in dimensionality reduction algorithms and
semi-supervised learning algorithms [5][7].
Deng Cai et al tested their LapPLSI model using the benchmark Reuters9 and
TDT210 databases. Results were validated using similarity measurements that compared the topic labels and document classification of the LapPLSI results to the
hand-tagged benchmark labels of each dataset. The Manifold Assumption, using
equations from [2] Deng Cai et al:
9

Reuters corpora is a database of manually categorized newswire stories:
http://www.daviddlewis.com/resources/testcollections/reuters21578/
10
TDT2: Topic Detection and Tracking Evaluation, a database of manually categorized, “multiple sources of information in the form of both text and speech from newswire and radio and
television news broadcast programs”, http://www.itl.nist.gov/iad/mig/tests/tdt/1998/

19

Algorithm 2.3 Manifold Assumption Overview
Let: fk (d) = P (zk |d) be the Conditional Probability Distribution Function (PD )
And let: kfk k2M be used to measure the smoothness of fk along the geodisics
of the intrinsic geometry of PD .
Then: the support of PD is a compact submanifold M ⇢ RM and:
kfk k2M =

Z

d"M

k rM fk k2 dPD (d)

(2.6)

where:rM is the gradient of fk along manif old M

Simplistically, a compact manifold refers specifically to a manifold that is compact on a topological space, and generally implies that the manifold is without
boundary. A circle is the only 1 dimensional compact manifold and a sphere is an
example of a 2-dimensional compact manifold. Using these examples it is easy to
visualize the usefulness of the compact submanifold in the LapPLSI algorithm: it
can be covered by a finite number of coordinate charts (it can be mapped) and
any continuous real-valued function applied to it is bounded. Since the document
manifold is generally not known, fk (d) can’t actually be computed, but it can be
discretely approximated using nearest neighbour models [2].
Which leads naturally to the topic of clustering...

2.3

Similarity Measurements for Clustering

Classification versus clustering: Classification is a procedure used in supervised
learning; clustering is a procedure used in unsupervised learning.
In classification, documents are ‘classified’ (assigned to a ‘class’) by hand. For
example, each document in the Reuters corpora has been hand-tagged by assigning
20

to it the topic (label) with which it is most closely aligned: ‘news’, ‘entertainment’,
etc.
Clustering does not compare each document to a pre-determined set of handtagged labels, instead it chooses the most likely cluster, or label, for each document,
essentially creating clusters (groups of related documents) in which the documents
in each group are more similar to each other than to the documents in other groups.
What constitutes this ‘similarity’ is specific to each application.
Computational cost (efficiency versus accuracy) must be considered in the choice
of clustering method, as well as the general shape of the data (natural language
generally produces elliptical clusters). Each clustering method is suited to specific
data distribution patterns, or shapes, and carries trade-o↵s in accuracy as well as
computational time and space requirements. This project uses unsupervised learning
and therefore uses clustering.
Stable evolution of learning: Clustering an aggregate of all data, new and
old, each time the algorithm is run, allows results to be created with no preconceived
idea (no prior knowledge) of past cluster topics and past members of those clusters.
Results obtained in this manner are more accurate than results obtained by running only new data and aggregating with previous results [18]. By running a fast
algorithm like LapPLSI, data can be aggregated and re-analyzed more frequently.

2.3.1

Centroid-based Clustering

In centroid-based clustering, clusters are represented by a central vector (an average
of all vectors within the cluster).
The K-means model is an example of centroid-based clustering in which data
is partitioned into a Voroni diagram (a visual representation of the borders between

21

clusters), with fuzzy borders between clusters11 :
Figure 2.4: K-Means clustering: Voroni Partitioning

In this model objects are assigned to the nearest cluster center, not the nearest cluster, and this results in the optimization of distance-to-cluster-centers, not
distance-to-clusters.
K-means is an NP-hard optimization problem that finds only the local maximum;
it is also run with either random or pre-assigned cluster centres and there is no
guarantee that the number of clusters created, or the centres used in each run, are
optimal. This means that each run of the K-means algorithm assigns each document
to its most relevant cluster, but during multiple runs of the algorithm documents can
shift between clusters. To combat sub-optimal partitioning, K-means is generally
run multiple times (10x is common) and the results can be either filtered with fitness
criteria or aggregated.
K-means is a fast, versatile clustering method that pulls non-spherical data into
spherical clusters, as shown below with the Iris Dataset12 . It does not create a
11

Voroni partition graphic: Chire, Own work, https://commons.wikimedia.org/w/index.php?curid=17085714
Fisher, 1936: https://archive.ics.uci.edu/ml/datasets/Iris, “data set contains 3 classes of 50
instances each, where each class refers to a type of iris plant. One class is linearly separable from
the other 2; the latter are NOT linearly separable from each other”.
12

22

probability distribution to allow granularity of clustering based on threshold cluster
membership, or analysis based on strength of cluster membership. such as clustering
all documents with a 0.90 probability of membership in a particular cluster.13
Figure 2.5: K-Means clustering of the Iris Dataset

2.3.2

Distribution-based clustering

Distribution-based clustering is used by statistical analysis algorithms such as LSA,
LDA, and LapPLSI. As is shown in the LapPLSI model, distribution-based clustering is able to bring out complex correlations within the data and this ability is
optimized when over-fitting is under control.
Distribution-based clustering produces a Conditional Probability Distribution
Function PD consisting of calculations for each word:
1. a posterior probability P( zk |di , wj ): the probability of latent variable z, given
word w in document d.
2. and a-priori probability: P( zk |di ): the estimate (using the posterior probability) of the probability of latent variable z, given document d.
13

Iris Dataset graphic: https://en.wikipedia.org/wiki/K-means clustering

23

This allows for granular adjustment of such things as cluster shape.
Figure 2.6: Normal Distribution: LapPLSI versus Gaussian

Gaussian mixture models are a well-known example of distribution-based
clustering. They are more stable and accurate than the K-means model but less
efficient. They work best on data with a compact shape that clusters naturally to
centroid shaped clusters; natural language tends to elliptical clusters. To avoid overfitting, the data is usually modelled with a random initialization of a fixed number
of distributions.14 [1]
Figure 2.7: Cluster: Gaussian example
14

LapPLSI vs Gaussian Distribution graphic: http://www.europeanfinancialreview.com/?p=217#!prettyPhoto/0/

24

LapPLSI uses the Laplachian model and customized parameters, as covered in
Section 2.2.5. The hidden topics extracted by topic modelling are the clusters and
the estimated conditional Probability Distribution Function P (zk |di ) can be used as
the label of each cluster.
This method produces clusters that are more compact and focused than those
produced by more conventional clustering algorithms (for example K-means models),
better capturing the hidden geometric structure of the document space [2].
Pragmatically, this translates into the algorithm’s ability to find conceptual connections within small documents, especially important in this project, where each
‘document’ is actually a ‘sentence’.

2.4

Natural Language Processing Algorithms

Natural Language Processing (NLP) is a field that explores the interaction of computers with human (natural) languages. For our purposes we will describe NLP

25

similarity measurements in terms of first and second order matrix representation.15
The NLP sections of the automatic generator architecture take advantage of the
power of a-priori association as bigrams16 [17].

First-order representation
Firat order representation works with frequency: how many times a word, or group
of words, appears in a specific context (sentence, document, etc). If two sentences
have a high percentage of words in common, first order methods will rate them as
conceptually similar. Since it is likely that these sentences are, in fact, conceptually similar, these methods return fewer false positives than second order methods.
Conversely, short sentences that are conceptually similar do not always share a high
percentage of words in common, making false negatives an issue for these methods
[15]. An example of a first order co-occurrence representation is shown below: a
document-term matrix where each column is a term (in this case a bigram) and
each row is a sentence. The matrix represents the number of occurrences of each
bigram in each sentence:
Example sentences:
• “The happy child and the big dog found a mud puddle.”
• “The big dog found the best mud puddle.”
• “The orange cat loves raisin toast and sunny places.”
• “The big dog chased the orange cat into the lake.”
15
16

http://www.nltk.org/book/: text& tutorial for NLTK and Python, opensource
n-grams: bigrams, trigrams, etc are groups of words that appear together in a corpus

26

Table 2.2: First Order Co-occurrence Matrix: vectors of bigrams
happy

big

raisin

orange

mud

child

dog

toast

cat

puddle

sent-1

1

1

0

0

1

sent-2

0

1

0

0

1

sent-3

0

0

1

1

0

sent-4

0

1

1

0

0

Second order representation
Second order representation addresses the problem of false negatives. It detects
similarity in short sentences by weighting each word according to its importance
across the entire corpus:
Step 1: Using the example sentences from the previous section, we create a word
by word matrix of bigrams and use Singular Value Decomposition (SVD) to reduce
dimensionality:17
Table 2.3: Second Order Co-occurrence Matrix, optimized with SVD
child

dog

toast

cat

puddle

happy

325

128

0

0

0

big

0

145

0

153

0

sunny

0

0

76

23

0

mud

0

92

0

0

163

raisin

0

0

32

0

0

17
fictional word counts and weights, representing the analysis results of a fictional corpus, are
used for the entire 2nd order example

27

In Table 2.3 each entry is now a feature weight (not a word count) and represents
each word’s importance in the overall corpus.
• the words “happy” and “child” are first order co-occurrences: they appear as
a bigram in the corpus
• the vectors for “happy” and “mud” are second order co-occurrences: “happy
mud” does not appear as a bigram in the corpus, but their vectors exhibit an
abstracted similarity (a second order similarity) because both words appear
in a sentence with “dog”
Step 2: Using sent-1 as an example (“The happy child and the big dog found a
mud puddle.”), each word in sent-1 that has a corresponding matrix row in Table
2.3 is replaced by that row. This set of words (row) is the vector that represents that
word’s conceptual connections within the corpus. Three of the bigrams in sent-1 are
still present in the matrix after SVD optimization:
Table 2.4: Second Order Co-occurrence: sentence one vectors
child

dog

toast

cat

puddle

happy

325

128

0

0

0

big

0

145

0

153

0

mud

0

92

0

0

163

and:
“The HAPPY child and the BIG dog found a MUD puddle.”
becomes:
The [child, dog] child and the [dog, cat] dog found a [dog, puddle] puddle.
28

Step 3: The average is taken across all 3 vectors to transform this second order cooccurrence into a second order representation. The resulting weights now represent
the conceptual connections of this sentence within the overall corpus.
Table 2.5: Second Order Representation: sentence one vectors

sent-1

child

dog

toast

cat

puddle

108.3

149

0

62.3

224

Frequency distribution (fdist) matrix:
This matrix is a term frequency representation of the data; each row is a document
(sentence, in this project) and each column is a word. N-grams and n-gram window
techniques can be used in conjunction with an fdist matrix to determine which terms
are most strongly connnected to other terms in the corpus. N-gram models are used
to create a picture of which words we can expect to see next to the current word:
a bigram model for ’n = term 1 ’ would show us which terms appear as ’n+1 =
term 2 ’, the term that directly follows term 1. Collocations and filters are NLP
tools that can be used to find out how important each bigram is to every other
bigram, and to the overall document as well [14].

Term frequency-inverse distribution function (tfidf ):
This function is an alternative to the fdist described above. The matrix created represents a weighting of conceptual strength across documents. Each term is weighted
for its importance in the overall corpus. A word that occurs often or rarely will
likely not contribute to the overall semantic meaning of the text; therefore, rare,
overly-prolific, and non-informative words will have a low rating. For example, if
a word such as “Klingon” appeared once in a collection of 600 documents, each
29

containing approximately 3000 sentences, it would be removed in the tfidf analysis.
Such an isolated, specialized word would be removed during later analysis anyway
- there is simply nothing to connect this word to a useful number of other words
- better to remove it before the more computationally expensive step of linguistic
analysis. On the other hand, words such as “is” are usually prolific enough to be
removed as well. Such words can appear so many times that they would fill the top
levels of analysis and subvert more relevant concept connections.

Tokenization and noise reduction tools
The following NLP tools are described in more detail in Chapter 3:
1. Tokenization: break a stream of text symbols into words, phrases, symbols,
or other meaningful elements; tokenize a document into sentence or tokenize
sentences into words.
2. Stripping: remove punctuation and stopwords (common words such as “the”
and “a”).
3. Stemming: reduce each word, where possible, to its root form, and remove
capitalization.
4. Indexing: after all of the above, including the removal of many sentences, each
remaining sentence must be tagged with its original document and sentence
number.

Software and model examples
Software and models in this section use supervised or semi-supervised machine learning. They also use any or all of: part-of-speech (POS) tags, labelled text, tagged
30

dictionaries.
The Stanford NLP Group focuses on probabilistic approaches that include
both statistical and NLP models, utilizing both supervised and semi-supervised
machine learning: part-of-speech (POS) tags, tagged text, and hand-tagged dictionaries. The group maintains many online resources, including a part-of-speech
tagger and probabilistic parser.18
SenseClusters, as mentioned in Section 2.2.2, is a graduate project created by
Professor Tom Pedersen and his graduate students at the University of Minnesota
in 2002-2004. It is a semi-supervised machine learning model that utilizes a tagged
dictionary (WordNet) for weights and disambiguation, as well as a tagged benchmark corpus for testing. SenseClusters is an LSI model that utilizes SVD19 for
dimensionality reduction and focuses on second order matrix and feature selection
for concept clustering [16].
SCIgen is a Graduate research project created at MIT CSAIL20 [20]. SCIgen
generates random computer science research papers, which have been submitted to,
and accepted by, conferences worldwide. The project uses unsupervised machine
learning and a basic sentence structure to generate its research papers. The SCIgen template was modified and relaxed to create the linguistic template for the
Automatic Generator.
PyProse: Formerly MacProse, PyProse [6] uses a dictionary that is completely
unstructured. The purpose of PyProse is to generate poetic metre, not to generate
cognizant output; therefore, the output produced was too nonsensical to use in this
project (words are not even loosely grouped into categories):
To rise couldn’t repair the waste, and to emerge was the camp of paint
18

http://nlp.stanford.edu/software/tagger.shtml
Singular Value Decomposition
20
https://pdos.csail.mit.edu/archive/scigen/
19

31

between the vortex and a spot.
Since any orchestra below a farmer would walk, why have they raced?
Although the stone has competed, how have so greater a driver talked?
The next two programs do not create original content, they use text that is culled
from the web.
The Apostrophe Engine: 1994 Bill Kennedy created a poem titled “Apostrophe (1994)” which consisted of a list of sentence fragments culled from the web, each
beginning with the bigram “you are”. The poem became the homepage of the Apostrophe Engine. Originally, each fragment was a hyperlink that, when clicked, would
prompt the site’s search engine to scan the web for new sentence fragments beginning with the words “you are” and containing default keywords from the fragment.
Each sentence fragment was terminated when the engine encountered a period.
Currently the apostrophe engine is o✏ine. In the words of Bill Kennedy: “We
speculated from the outset that once sections of the book began to appear online,
the engine would begin to cannibalize itself, returning its own results before other,
less likely matches.”21 [10]. Output from the Apostrophe Engine, while it was still
online:
you are a soldier
you are implying- Sherlock Holmes: I’m not implying anything
you are an idiot
you are the one who shot him
you are choosing to ignore anything you see that doesn’t comply with it
you are not serious
The Apostrophe Engine was not suitable as a text generator in this project as the
21

http://www.apostropheengine.ca

32

output text was both culled from the web and had the added restriction of being
delimited by “you are”. These two words can be removed from each sentence but
doing so leaves the majority of sentences unusable, as shown below:
a soldier
implying- Sherlock Holmes: I’m not implying anything
an idiot
the one who shot him
choosing to ignore anything you see that doesn’t comply with it
not serious
Poetry Machine A text generator created by David Link that uses word association and semantic relationships. Poetry Machine can be initialized using its own
random word generator or using words entered from a keyboard. The generator uses
text culled from the web.22

2.5

Summary

• This project is based on three principles:
1. pseudo-random text generation that preserves linguistic integrity at a
basic new-language-learner level
2. unsupervised machine learning for architecture portability and evolution
3. low computational cost
• Statistical algorithms:
22

http://alpha60.de/art/poetry machine/

33

– VSM, Vector Space Model: bag-of-words representation; document-term
matrix, cosine similarity clustering
– LSI, Latent Semantic Indexing: uses SVD and tfidf to reduce dimensionality and optimize concept visibility.
– PLSI, Probabilistic Latent Semantic Indexing: a generative implementation of LSI that maximizes the joint probability of documents and terms
in a corpus. PLSI estimates the probability distribution of each document on the hidden topics independently and the number of parameters
in the model grows linearly with the size of the corpus. Problems with
overfitting.
– LDA, Latent Dirichlet Allocation: implements the EM algorithm of PLSI
with a Dirichlet distribution. The probability distribution of each document, over topics, is treated as a hidden random variable. Overfitting is
now under control.
– PLSI and LDA both discover hidden topics, but only in Euclidean space.
Recent studies suggest that the embedded sub-manifold is a more accurate representation of the document space [10][23].
– LapPLSI: Laplacian Probabilistic Latent Semantic Indexing: implements
a regularizer variable and other custom parameters in the EM algorithm
of PLSI. LapPLSI models the document space as a submanifold embedded
in the ambient space, revealing hidden semantics and deeper concept
connections beyond Euclidean space.
• NLP tools:
– basic document processing tools: tokenizing, stopword / punctuation removal, word stemming, capitalization normalization, matrix indexing
34

– word frequency and placement analysis: frequency distribution (fdist)
matrix (term frequency representation), n-grams, collocations
– term frequency-inverse distribution function (tfidf) matrix: each entry is
now a feature weight (not a word count) that represents the importance
of each word/term across the entire corpus
• No part-of-speech (POS) tagging or established relationships between words.
This project uses a modified SCIgen document template and a custom dictionary (word list only). To begin completely from scratch, with no word
grouping, while intriguing, is a much larger project.
• Document clustering (for this project sentence clustering) on the embedded
submanifold. The hidden topics extracted by the topic modelling approaches
can be regarded as clusters, and the conditional probability density of each
cluster P (zk |di ) can be used as the cluster label (topic) of each document
(sentence).
• Dimensionality reduction: NLP tools process the document to reduce noise
(punctuation, stopwords, etc) and either a tfidf or an fdist matrix is created.
• Stable evolution of learning LapPLSI uses distribution-based clustering, reducing the computational cost of time and space. This allows a more frequent
aggregation and re-processing of ALL current data, avoiding the instability
associated with integration of new weights and similarity results with old.

35

Chapter 3
Architecture of the Automatic
Generator
The objective of this project is to design an unsupervised machine learning architecture that can gather meaningful concepts, at low computational cost, from randomly
generated text and link these concepts together in a poem. To do this, an architecture was designed consisting of five discrete components (blocks) and a set of
custom tools, parameters, and metrics. These are combined with proven methods
of both statistical inference and NLP (linguistic) analysis. The metric for success is
the automatic generation of poetry, that can pass as human-generated, as assessed
by 20 volunteers.
A varying number of documents are generated, concatenated, and pre-processed
in Block 1, to create the matrix and sentence files for input to Block 2. In the
experiments documented in this chapter, concatenated documents of 100, 300, 600,
and 1200 original documents were created from each dataset, resulting, on average,
in a matrix containing:

36

Table 3.1: Document-Term Matrix, average stats
size (kb)

rows (sentences)

columns (words)

cells

100 documents

4

2 300

1 800

1 400 000

300 documents

14

2 900

4 800

4 600 000

600 documents

30

3 400

8 700

10 000 000

1200 documents

71

4 400

15 900

23 500 000

In Block 3 each of these document sets was then run through a set of parameter
options: 5, 7, and 10 latent topics/clusters, and multiple probability thresholds,
producing optimized, reduced files. In Block 4, an fdist matrix was created for each
reduced file and trials were run for bigram and trigram options, each with Pointwise
Mutual Information (PMI) and Logistic Regression (LR) analysis, as outlined in
Section 2.4. The following table shows the average number of words and sentences
in the Block 4 fdist matrix, 5 cluster, 600 document trial. The average number
of unique words falls within the optimum range (100 unique terms) for the NLP
analysis performed in this block [11]:
Table 3.2: Document-Term Matrix: 600 documents, average stats

5 clusters

size (kb)

rows (sentences)

columns (words)

cells

unique words

288

424

647

287 809

200

The original data is generated using pseudo-random methods: pseudo in the
sense that the text produced is neither grammatically correct, nor is it completely
nonsensical. The text reflects a sociolinguistic balance: a basic level of language
structure and word placement as would be normally assimilated by young children,
or by travellers in a foreign country. An augmented SCIgen template [20] and

37

customized dictionary (terms only, no tags) was used. Basic patterns emerge from
this structure that can be built on:
Block 1: Generate dictionary. Generate text.
Create and compile the dictionary. Create the data using the modified
SciGen template and the custom dictionary.
Block 2: File processing. Noise reduction. Create viable sentences.
Process files using basic NLP analysis techniques such as stripping, stemming, and stopword removal. Create files of viable sentences for input
to Block 3.
Block 3: Statistical Analysis.
Reduce noise even further by creating term frequency-inverse distribution function (tfidf) matrix. Analyze the files with the embedded submanifold techniques used in Deng Cai’s LapPLSI algorithm [2]. Adjust
sentence choices with threshold probability parameters.
Block 4: NLP Analysis.
Analyze file with advanced NLP analysis techniques such as n-grams,
collocation filters, pointwise mutual information (PMI) and logistic regression (LR).
Block 5: Evolve concept visibility parameters. Polish the poems.
Implement patches to polish the poems. Evolve the sociolinguistic concept connections.

In this project we have implemented Blocks 1-4. With experiment we found
that the best results were produced by using files of 600 concatenated documents,
clustered over 5 hidden topics and analyzed with LapPLSI, bigrams and mutual
information techniques. These files produced the final matrix of viable sentences for
creation of the poems. We can expect even better results with larger raw input files.

38

3.1

The Challenges

Linguistic integrity: To retain the semantic structure of the data requires the
retention of not only the words in a sentence, but also the underlying meaning of
the sentence itself. Optimal retention of semantic structure (linguistic meaning)
must be balanced with the computational costs of space and time.
Dimensionality reduction: The computational cost of space and time necessary for dimensionality reduction must be balanced with algorithmic options for
optimal retention of semantic and syntactic integrity (optimal concept retrieval).
Techniques such stemming, stopword removal, and tfidf matrix creation help to
reduce dimensionality with minimal impact on text semantics [19][13].
Stable evolution of learning: Both dimensionality reduction and linguistic
integrity contribute to a stable evolution of learning, defined in this context as the
accuracy of analysis results over time.

3.2

The Approach

To address semantic integrity we consider the order of stochastic analysis, and
the linguistic structure of the input data. For the order of stochastic analysis the
first question is how to approach dimensionality reduction. Either statistical or
semantic (NLP) analysis must be used for this important step in processing.
Using statistical inference as a first step, as we have done in this project, creates a tfidf (term frequency-inverse distribution function) matrix and a Probability
Distribution PD , reduces noise, and increases concept visibility. Statistical inference
also improves disambiguation, finding relationships between sets of information, or
concepts (co-occurrences, for example), that are often eliminated when semantic
methods are used as a first step [19]. NLP tools can then be used to find the ways
39

in which these concepts are inter-related. Take for example the following text:
“blind venetian” and “venetian blind”
Lexical analysis would not connect the two sets of words but statistical analysis
preserves them in the data set. Statistically independent data is always linguistically
unrelated BUT linguistically unrelated data is not always statistically independent
[15].
To preserve linguistic integrity we assess various facets of language structure
and word placement. Maximizing retention and visibility of the text’s inherent
underlying meaning must be balanced against the over-use of language constructs.
The goal is to strike a sociolinguistic balance between nonsensical text and rigid
grammatical structure. For example: “dog big” may be a first attempt by a young
child but exposure to native speakers would organically evolve the description to
“big dog”.
To address dimensionality reduction we look at the computational time/space
requirements and optimal ways of reducing to useful data. Document representation with unsupervised machine learning produces a naturally sparse matrix of high
dimensionality (one dimension for each term in the corpus), making dimensionality
reduction an important consideration. Statistical analysis as a first step is an effective optimization method, reducing data volume, and increasing the strength of
conceptual relationships.
Raw versus processed data is also a consideration. Raw data includes punctuation, stopwords, newline characters, all versions of words, duplicate sentences, etc.
In our trials the sheer volume of data in each raw text file created an unmanageably large matrix when run on the available resources, severely limiting the size
of the data samples that could be used, and making relevant comparisons between
raw and processed data unrealistic. There is some debate in the document analysis
40

community about whether stemming, stopword removal, and punctuation stripping
improves output [15], but all three methods considerably reduce data noise, allowing
larger datasets to be processed.
There is also no evidence in the literature to conclude that running raw data
through the architecture, even if the computing resources were available, would
produce better results. The process of dimensionality reduction naturally removes
the majority of stopwords and punctuation from raw data, even if this information
is not removed in pre-processing, but it also leave behind some connector words,
such as “at” or “therefore”, in the raw data. These words incur a higher analysis
cost at later stages. Adjusting n-gram windows in raw text can help, as is shown in
the following example:
Starting with the raw text sentence:
The dog has a toy.
We first run a dimensionality reduction algorithm to find the co-occurrences of
“dog” and “toy” in the sentence. A trigram analysis with an n-gram window1 is
required. Processing a file and using trigram windows is computationally expensive:
A terms produces A3 trigrams plus the n-window permutations. Processed text, as
opposed to raw text, requires only bigram analysis (A terms produce A2 bigrams) to
catch “dog toy”, “dog has toy”, “dog and toy”, etc. Running trigrams on processed
text did not produce better results, indicating that the Block 2 processing protocol
creates bigram optimized data.
To address stable evolution of learning we look at clustering options that
maximize relevance, speed, and the stability of results. To create a stable evolution
of learning, all data is processed on every run: new data is aggregated with old and
1

analyzing groups of words, looking for 2 keywords separated by a non-keyword window of n
words

41

fresh cluster results are produced, allowing analysis to start fresh every time the
algorithm is run, with no pre-conceived idea of cluster topics and cluster members
from past trials. If new data is always run discretely, and the results integrated with
previous results, the overall outcome can become skewed [18]. The low cost of scaling
the LapPLSI-optimized algorithm means data can be aggregated and re-clustered
frequently.

3.3

The Architecture

The architecture consists of 5 blocks, each a discrete part of the Automatic Generator. In this section we outline, for each block:
• What the Block does
• Why it is important
• How it fits into the big picture

42

Figure 3.1: Block 1: Create dictionary. Generate text

The purpose of Block 1 is to generate carefully structured pseudo-random text:
create and compile the dictionary, generate files of raw, pseudo-random text using
the modified SciGen template. This block regulates linguistic integrity and language
level.

procedure Run dictionary program
create custom dictionary.
overwrite SciGen dictionary with custom dictionary.
modify and customize SciGen template.
end procedure
procedure Run modified SciGen program
use dictionary and template to generate files of raw, pseudo-random text.
generate number of documents required.Write each to file. Concatenate.
end procedure

43

Figure 3.2: Block 2: Create Viable Sentences

The purpose of Block 2 is to create a file of usable sentences from the pseudo-random
text generated in Block 1. Custom NLP tools and techniques are used to reformat,
clean, delimit, tokenize, and index each file. Matrix sparsity (noise) is also reduced
through these techniques.

procedure Process file
for each sentence in file do
tokenize to sentences
tokenize to words
strip punctuation
end for
end procedure
procedure Locate viable sentences
for each sentence in file do
keep viable sentences; write to file
end for
end procedure
procedure Remove syntactic markers
for each sentence in file do
normalize
strip stopwords
stem
index
end for
end procedure

44

. 3-12 words

. augmented algorithm
. augmented algorithm
. index words to sentences

Figure 3.3: Block 3: Inference Analysis

Use statistical analysis tools to identify and extrapolate concepts: The purpose
of Block 3 is to identify deeper structural patterns in the text through statistical
analysis techniques. The output of Block 2 is analyzed using the LapPLSI algorithm
for embedded sub-manifold analysis [2] outlined in section 2.2.4.

procedure Analyze with LapPLSI
for each file do
create tfidf
. weighting: how important is each sentence
extract joint probabilities
. run LapPLSI
end for
end procedure
procedure Cluster at Probability Threshold
retain sentences above designated Probability Threshold
end procedure

45

Figure 3.4: Block 4: Natural Language Processing (NLP) Analysis

Use semantic tools to identify and extrapolate concepts: The data from Block 3 is
prepared for analysis (similar to the processing in Block 2), then analyzed with NLP
tools to extract groups of conceptually-connected sentences for the poems.

procedure Process file
for each sentence in file do
tokenize to sentences
tokenize to words
strip punctuation
end for
end procedure
procedure Remove syntactic markers
for each sentence in file do
normalize
strip stopwords
stem
index
end for
end procedure

. augmented
. augmented
. index words to sentences

procedure Analyze the file semantically with NLP tools(cluster sentences with related semantic concepts)
for each file of sentences do
create fdist
for each option: bigrams, trigrams do
Pointwise Mutual Information
Logistic Regression
end for
end for
end procedure

46

Figure 3.5: Block 5: Fine-tune sentence creation and selection.

The purpose of Block 5 is to assemble the poem (lines, stanzas, author) and to evolve
the HCI cognition-response by auto-generating fitness values for specific fine-tuning
objectives.

procedure Auto-generate fitness values
for each socio-linguistic goal do
define contributing parameters
combine in fitness formula
end for
end procedure
procedure Analyze the output
for each file do
edit repetition
run fitness formula
cluster pre-determined number of conceptually related sentences
end for
end procedure
procedure Assemble and polish poem(stanzas, formatting, author)
for each file do
cluster stanzas based on pre-determined parameters
add blank line between stanzas
add punctuation, capitalization
generate author
end for
end procedure
run analytics

47

Chapter 4
Implementation Detail
4.1

Block 1: Create the dictionary and the data

The purpose of Block 1 is to generate carefully structured pseudo-random text. A
modified SCIgen template and a custom dictionary are used to generate the text
files.
The project is based on unsupervised machine learning: no hand-tagged training sets, no pre-tagged data. This means that the Stanford Part-of-Speech-Tagger1 ,
WordNet2 , Visual Thesaurus3 , and similar online tools were not suited to this
project.
For the dictionary this meant not using WordNet or similar tagged dictionary
databases. The dictionary for this project is a wordlist only combination of three
dictionaries: my own custom, and the dictionaries from the SciGen and PyProse
programs. This block regulates the linguistic integrity (language level), semantic integrity (basic, cultural, and social vocabulary), and the syntactic integrity
1

https://nlp.stanford.edu/software/tagger.shtml
https://wordnet.princeton.edu/
3
https://www.visualthesaurus.com/
2

48

(language-learner word order). For example: “The red car raced down the road.” is
fine; “Raced road car red the.” is not.
The input text itself needed to be pseudo-random but not nonsensical; basic word
order and social / cultural entries, in the spirit of a new-language-learner syntax, as
documented in Section 3.3. The basis for this structure can be found in [3].

4.2

Block 2: Create viable sentences

The purpose of Block 2 is to create a file of viable sentences from the text generated
in Block 1.
Custom NLP tools and techniques are used to reformat, clean, delimit, and
concatenate each file.
• Remove syntactic markers: basic processing and reformatting of data to clusterready: tokenize to sentences, tokenize to words, normalize to lower-case, strip
punctuation and stopwords, stem to root words.
• Create an index - need to link the words back to their original sentence(s).
• Cluster viable sentences. Write to file: minimum 3 words, maximum 12 words:
sentences of 1-2 words increase data noise and rarely comprise a viable sentence
(“I am” is one of very few) and more than 12 words is very long for a line of
poetry and did not increase the quality of viable sentences in our results.
Note: stripping punctuation and stopwords, as well as stemming to root words,
decreases matrix size at the cost of increasing ambiguity (for example, ‘stocking’
and ‘stocks’ both stem to ‘stock’). The statistical analysis of Block 2 and Block 3
restores disambiguity.

49

TOKENIZE to sentences:
[‘Experts regularly measure the place of agents.’, ...]
TOKENIZE to words:
[[‘ ‘Experts’, ‘regularly’, ‘measure’, ‘the’, ‘place’, ‘of’, ‘agents’, ‘.’ ’] ...]
STRIP punctuation and CULL sentences to desired min/max:
[[‘ ‘Experts’, ‘regularly’, ‘measure’, ‘the’, ‘place’, ‘of’, ‘agents”] ...]
NORMALIZE file (lower-case):
[[‘experts’, ‘regularly’, ‘measure’, ‘the’, ‘place’, ‘of’, ‘agents’]]
STRIP stopwords:
[[‘experts’, ‘regularly’, ‘measure’, ‘place’, ‘agents’]]
STEM with Porters Stemming Algorithm:
[[‘expert’, ‘regularli’, ‘measur’, ‘place’, ‘agent’]]
INDEX:
[[(3, ‘expert’), (3, ‘regularli’), (3, ‘measur’), (3, ‘agent’), (3, ‘place’)]]

Locating viable sentences of suitable length was also a challenge. Online tools like
the Stanford POS Tagger provide grammatical sentence structures such as phrasal
chunking (dividing sentences into phrases) and clausal grouping (groups of phrases)
but a phrase is, by definition, a conceptual unit - one component of a clause - and
therefore requires part-of-speech tagging in order to be identified.

50

4.3

Block 3: Distribution-based Analysis

The purpose of Block 3 is document representation and topic modelling through
statistical inference. This Block uses the protocol outlined by [2] Deng Cai et al
to analyze each file of documents (sentences) and cluster the related concepts using
the LapPLSI protocol. Functionality of this block includes:
• Create a tfidf matrix: term frequency-inverse distribution function matrix
represents a weighting of conceptual strength across documents. Each term is
weighted for its importance in the overall corpus.
• Model the document space as a submanifold embedded in the ambient space.
• Reduce matrix dimensionality.
• Identify the topic models (related concepts) in non-Euclidean space
• analyze file of documents (sentences) and cluster the related concepts using
the LapPLSI protocol. LapPLSI parameters used in this project:
– number of documents in each trial: 100, 300, 600, 1200
– number of topics / clusters in each trial: 5, 7, 10
• Create a probability distribution of the concept connections.
• Cluster sentences with related concepts (topics)
• Probability bar: Within each cluster (topic) keep the sentences that are most
closely related. Each sentence that is kept will have a probability of connection
to that cluster that is greater than, or equal to, a pre-set threshold. This step
determines the sentences that are used in poem creation (Block 4).

51

Algorithm 4.1 Generalized EM Algorithm for LapPLSI
Variables: N = number of documents, K = number of latent topics
M = number of terms (words) in the dictionary
Number of nearest neighbours = p, Regularization parameter =
Newton step parameter = , Termination condition value = ✓
Output: P (zk |di ), P (wj |zk ), i = 1, . . . , N ; j = 1, . . . , M ; k = 1, . . ., K
Compute the graph matrix W ([2] p 3);
Initialize probability distributions (parameters)
n

0 = P (zk |di )0 , P (wj |zk )0 ;

0;

While (true)
E-step: Compute the posterior probability ([2], p3);
M-step:
Compute P (wj |zk )n+1 (re-estimation equation ([2], p3);
Compute P (zk |di )n+1 (re-estimation equation ([2], p3);
(1)

P (zk |di )n+1

P (zk |di )n+1 ;
(2)

Compute P (zk |di )n+1 (Geometric Regularization Equation 2.5);
(2)
n+1 )

While(('(

(1)

P (zk |di )n+1

(1)
n+1 ))

'(

(2)

P (zk |di )n+1
(1)

Compute P (zk |di )n+1 (Geometric Regularization Equation 2.5);
If ('(

(1)
n+1 )

'(

n ))
(1)

P (zk |di )n+1

P (zk |di )n+1 ;

Else
n;

n+1

If ('(
n
Return

n+1 )

'(

n

✓)); break;

n + 1;
n+1

52

4.4

Block 4: Natural Language Processing (NLP)
Analysis

This block identifies patterns of repetition in the text by analyzing factors such as
how many times a specific word appears in a specific document, and identifying
co-occurrences of every corpus entry (word) with every other corpus entry, across
all documents.
The purpose of Block 4 is to identify pattern recurrence within patterns. A tfidf
matrix is created to normalize the data and create weights that reflect the overall
(global) impact of each word combination. The data from Block 3 is prepared for
statistical analysis (similar to the processing in Block 2), then analyzed with NLP
tools to extract groups of conceptually-connected sentences for the poems. Patterns
of bi- and tri- gram recurrence / placement are analyzed using collocation filters,
Pointwise Mutual Information (PMI) and Logistic Regression (LR).
• As in Block 3, remove syntactic markers: basic processing and reformatting
of data to analysis ready.
• Create a Frequency Distribution matrix (fdist) for each document.
• Create topics (clusters) of conceptually related bigrams and trigrams, using
collocation filters, Pointwise Mutual Information (PMI) and Logistic Regression (LR) analysis.
• Re-index words to sentences.

53

4.5

Block 5: Evolve the HCI Cognitive Response

The purpose of Block 5 is to assemble the poem (lines, stanzas, author) and to evolve
the HCI cognition-response by auto-generating fitness values for specific fine-tuning
objectives.
Fine-tune the HCI response for sentence creation and selection:
• Smooth: optimize the sociolinguistic connections with patches.
• Improve and fine-tune sociolinguistic parameters.
• Format: author, line length, stanza length, etc.
• Define advanced sociolinguistic goals.
• Auto-generate fitness functions for specific fine-tuning objectives.

54

Chapter 5
Experimental results and analysis
This chapter presents a comparison study using various selection parameters, as set
out in the Chapter 3 description of the Automatic Generator model. Specifically, we
conduct a series of experiments to compare the strength of the concept connections
produced when varying the size of the input dataset, the number of hidden topics
(clusters), and the choice of natural language analysis tools.

5.1

Environment

Traditionally, such results are evaluated based on their similarity to benchmark
and/or hand-tagged datasets. In the case of the embedded manifold code used here
[2] Deng Cai et al cluster two well-known datasets, Reuters and TDT2 (Section
2.2.4), comparing their resulting topic labels against the hand-tagged benchmark
labels of each dataset. Reuters and TDT2 are known as single label datasets: each
document is hand-tagged with the one topic to which it most strongly belongs.
This project uses unsupervised machine learning and therefore has no labels
to compare to a benchmark dataset. For our purposes, we evaluate the resulting

55

sentence files through reader feedback.

Probability-label versus single-label datasets
The goal of our model is to cluster conceptually similar sentences from dynamically
created, pseudo-random, text. We begin with the understanding that hidden conceptual patterns will vary, within a dataset, in relation to variance in parameters
such as number of topics, size of input file, the subjective cluster threshold, and the
analysis tools used.
Therefore, in this project, we do not aim to improve the recognition of a benchmark label (news, entertainment, etc.) by increasing the size of the corpus for each
label. Instead we aim to improve the accuracy of concept discovery by allowing multiple labels to be attached to each sentence, and allowing the cluster probabilities
(the strength of the concept connections between these labels) to be re-computed
at each instance of new data. The final clustering represents not the ONLY label to
which the sentence belongs but the label to which it most strongly belongs.
The use of a probability distribution, rather than a single pre-tagged label, allows
a range of clustering output goals to be addressed by both coarse- and fine-grained
parameter adjustment. These goals include: threshold probability, n-grams, statistical analysis, secondary algorithmic reduction. Another advantage is low computational cost: the use of the embedded submanifold model allows the hidden topics
to be found, and their similarity within the probability distribution to be utilized
as an inherent clustering mechanism.

56

Analysis of Stability
If the AutoGen analysis is to remain relevant to a specific user, over time(evolve),
the analytic results must evolve to reflect changes in user interest, knowledge, and
understanding. This is a challenge for all NLP applications, as well as every applications that uses classifying/clustering. Results become unstable (skewed, unreliable) if historical probability values are simply integrated with new values. There
are complex ways to increase stability but such methods also increase complexity.
As mentioned above, LapPLSI’s embedded sub-manifold model has a considerably
lower computational cost than traditional stand-alone clustering algorithms, such
as K-means, allowing for increased stability without increased complexity.
One of the main advantages of this functionality is an increase in the stability
threshold. As the amount of aggregate data in each dataset varies, so does the depth
of ‘understanding’, or the discovery of latent concepts. For example, a student could
use their previous year’s essays as the initial AutoGen input dataset. As the current
year progresses, the student could quickly and easily recompile the AutoGen input
dataset with new essays included. In this way the integrity of the Auto Gen results
are preserved, also creating a stable evolution of learning.

57

5.2

Results of LapPLSI analysis in Block 3

Results of Block 3: clustering results after LapPLSI analysis and threshold culling.

5 clusters
Figure 5.1: 5 clusters, 100 Aggregated files

58

Figure 5.2: 5 clusters, 300 Aggregated files

59

Figure 5.3: 5 clusters, 600 Aggregated files

60

Figure 5.4: 5 clusters, 1200 Aggregated files

61

7 clusters
Figure 5.5: 7 clusters, 100 Aggregated files

62

Figure 5.6: 7 clusters, 300 Aggregated files

63

Figure 5.7: 7 clusters, 600 Aggregated files

64

Figure 5.8: 7 clusters, 1200 Aggregated files

65

10 clusters
Figure 5.9: 10 clusters, 100 Aggregated files

66

Figure 5.10: 10 clusters, 300 Aggregated files

67

Figure 5.11: 10 clusters, 600 Aggregated files

68

Figure 5.12: 10 clusters, 1200 Aggregated files

69

5.3

The Poems: Validation and feedback

The relevant metric for these experiments is the emulation of human concept connections. Therefore, comparing clusters of sentences to hand-tagged labels is not
useful, and the output (conceptually connected sentences arranged as poems) was
instead analyzed by 20 people. Each volunteer received 20 poems, with 10 AutoGen
poems and 10 human-created poems. After reading each poem a choice was made
between: “human created’, “computer created”, and “I don’t know”. It was not
practical to send each volunteer six sets of poems; therefore the best output was
chosen and used for reader feedback testing. The results are shown in the graphs
that follow.

An Architecture Generated Poem
Chaos
In the end, we removed the memory.
To begin...
We added the memory to religion.
quadrupled the e↵ective speed.
motivated the need for social networks.
doubled the speed of the dreamers.
Experts added the memory to our sector.
Crazy people added the memory to every sector.
Hackers removed the memory from religion.
Mad scientists added the memory to our sector.
70

tripled the e↵ective speed.
Social networks might not be the panacea that experts expected.
Social networks might not be the panacea that crazy people expected.
This might not be the panacea that experts expected.
The question is, will it satisfy all of these questions?
motivate the need for the child.
Over time, we tripled the space of our empathic sector.
Nevertheless, the child might not be the panacea the experts expected.
Next, we removed the memory from the low-energy sector.
added the memory to the omniscient sector.
This might seem unexpected but fell in line with our expectations.
quadrupled the popularity of a child.
motivate the need for smalltalk.

71

Legend for Figure 5.13:
1. Chaos

7. Hypothesis Humanity

2. I Remove the Memory
8. intentions
3. The Answer is Yes
9. The rest of life is a maze

4. The question is
5. Anonymized

10. puppies and sunsets and boats and

6. humanity

spaceships
Figure 5.13: Results of poem feedback

Human

Don't know

Computer

20
20
16.667
15

1010
6.667
5

Poem (number of clusters, number of documents)

72

9

:1

0

cl

,6

00

10
600

#1
8

00

,1

cl
5
5:

8

c

do

do

c

5
100

#1

:5

7

#6

00

,6

cl
0
:1
#1
0

6

5
100

cl
,1
00

do

do

00

,6

cl
0
:1
#3

5

10
600

c

c

cl
,6
00
:7
#5

4

10
600

do

c

00

,6
cl
5
0:
#2

3

7
600

do

c

00

,6
cl
5
4:
#1

2

5
600

do

do

cl
,6
00
:5
#9

cl
,6
00
:5

1

5
600

c

5
600

do

5
600

c

00

c

3.333

#2

Responses

13.333

10

Chapter 6
Conclusions and Future Work
To simulate human-computer interaction with the AutoGen, much thought was
put into how to generate pseudo-random data for simulation, testing, and baseline
creation. The input data is not tagged with topics, part of speech, or grammatical
structure. Dictionaries like WordNet, containing pre-defined words, were not used.
Text generation tools based on syntactic or semantic tagging, or text sourced from
internet searches, were not used. This meant that chunking (based on phrases) and
part-of-speech tagging could not be used nor could the many online text generation
tools based on WordNet, part-of-speech tagging, and internet search / text retrieval.
E↵ectively, pre-conception was minimized as a tool for clustering decisions. The
syntactic structure of the Scigen template was retained but modified, to reflect, in
an informal way, a basic syntactic structure such as children might learn over time
as they listen to others speak. A more in depth study of the “language is learned”
side of the “learned versus inherent” origins of language debate can be found in
Chomsky’s Reflections on Language [3] and similar texts. The ability to create
baseline metrics (reading level, cognition, aptitude, etc.) in future work, and to
re-evaluate these metrics in real-time, will increase the relevance of results.

73

Using specific analysis tools (in a specific order), a relaxed semantic structure,
and embedded sub-manifold analysis, we have shown good results for latent pattern recognition while maintaining granularity and parameter control, portability
options, and the low computational costs associated with distribution-based clustering.
This allows for future work and stable evolution of learning at a higher threshold
of data input, portability options, and user-customizable baselines.
The output clusters well around topics / concepts that lend themselves to poetry:
God, people, children, puppies and kittens, etc. The dictionary therefore appears
well-balanced. Order of analysis choices (using statistical analysis as a first step, lexical analysis as a second step) also appear to be well justified. Although a reversal of
statistical/lexical analysis order was not attempted, the chosen order of analysis has
preserved and/or enhanced the visibility of the latent semantic structure sufficiently
to produce favourable AutoGen results.
Trials were run for files that aggregated 100, 300, 600, and 1200 documents.
Each aggregate file contained an average of 1800, 4800, 8700 to 15900 documents
(in our case sentences), respectively. Documents aggregated from a number of input
files substantially larger than 1200 required computational time and space beyond
the available resources. 600 dimensions (unique terms) is considered optimal [11]
and, in our case, produced good results. In future work it would be interesting to
see at what point the increase in documents results in a substantial increase in value
and how that correlates to an increase in the number of clusters run in the LapPLSI
algorithm.
Each document (sentence) in the term-document matrix contains between 3 and
12 words. Sentences with a max 20 words were also tested but the results did not
noticeably improve or degrade. A maximum of 12 words per ‘line’ works well for
74

poetry. In future work it would be interesting to find a threshold for natural sentence
chunking and analyze its e↵ect on output.
This architecture was validated in the context of feedback from 20 people in a
blind study. The results show that the architecture does not require the computationally expensive step of external clustering to produce groups of sentences (poems)
with a favourable response. They also showed that the architecture does not require
tagged text, tagged dictionary entries, or phrasal chunking to create and discern
related concepts.
These results support the hypothesis, put forward by [21] Flores and Winograd
that unsupervised clustering methods bring visibility to patterns and clusters naturally existent within written text.
The results also support the conclusions of [2] Deng Cai et al, that the embedded
sub-manifold analysis LapPLSI algorithm showcases the value of statistical inference
and distribution clustering as e↵ective methods for low cost clustering and topic
modelling. The LapPLSI algorithm shows results that rival benchmark testing with
Reuters and TDT2 databases.
The goal is to identify concept connections in these sentences, where ‘concept’ is
defined as a sociolinguistic connection. These connections must be strong enough to
cluster into sets of sentences that express and extrapolate the ‘topic’ or ‘concept’ of
the cluster, thus allowing the AutoGen model to create clusters of sentences (poems)
that contain both variety and a reasonably cohesive theme.
Research results may contribute to the development of agile, personalized, SMART
technology in education and other areas of personal HCI, such as tutoring and query
disambiguation.
Future work:
• Run trials to analyze higher dimension latent concept connections.
75

• Seed text generator with weighted keyword-list.
• Generate text with pre-clustered concepts.
• For each trial, conduct 50 test runs on randomly chosen clusters.

76

Bibliography
[1] D. Blei and J. La↵erty. Topic models. In Text Mining: Classificaton, Clustering, and Applications, Chapman & Hall/CRC Data Mining and Knowledge
Discovery Series, pages 50–57. Taylor & Francis, 2009. ISBN 9781420059403.
doi: 10.1145/312624.312649.

URL https://books.google.ca/books?id=

h4ommQEACAAJ.
[2] Deng Cai, Qiaozhu Mei, Jiawei Han, and Chengxiang Zhai. Modeling hidden
topics on document manifold. In Proceedings of the 17th ACM conference on
Information and knowledge management, pages 911–920. ACM, 2008.
[3] Noam Chomsky. Reflections on Language. Pantheon., New York, 1975.
[4] Eddy J Davelaar.
Behavior:

Connectionist Models of Neurocognition And Emergent

From Theory to Applications (Progress in Neural Processing).

World Scientific Publishing Co., Inc., River Edge, NJ, USA, 2011.

ISBN

9789814340342, 9814340340.
[5] Mark A. Davenport, Chinmay Hegde, Marco F. Duarte, and Richard G.
Baraniuk. Joint manifolds for data fusion. Trans. Img. Proc., 19(10):2580–
2594, October 2010. ISSN 1057-7149. doi: 10.1109/TIP.2010.2052821. URL
http://dx.doi.org/10.1109/TIP.2010.2052821.

77

[6] C.O. Hartman. Virtual Muse: Experiments in Computer Poetry. Wesleyan
poetry. University Press of New England, 1996. ISBN 9780819522399. URL
https://books.google.ca/books?id=Io3lrOpdtasC.
[7] Xiaofei He, Deng Cai, Haifeng Liu, and Wei-Ying Ma.

Locality preserv-

ing indexing for document representation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’04, pages 96–103, New York, NY, USA,
2004. ACM.

ISBN 1-58113-881-4.

doi: 10.1145/1008992.1009012.

URL

http://doi.acm.org/10.1145/1008992.1009012.
[8] Xiaofei He, Deng Cai, Haifeng Liu, and Wei-Ying Ma.

Locality preserv-

ing indexing for document representation. In Proceedings of the 27th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, SIGIR ’04, pages 96–103, New York, NY, USA,
2004. ACM.

ISBN 1-58113-881-4.

doi: 10.1145/1008992.1009012.

URL

http://doi.acm.org/10.1145/1008992.1009012.
[9] Daniel Jurafsky and James H. Martin. Speech and language processing, chapter 19 vector semantics, 2015. URL https://pdfs.semanticscholar.org/
8bb2/04bf6fb0d564b6059868659320e1d71cc3fc.pdf?_ga=1.250525702.
1076128881.1481165854.
[10] B. Kennedy and D.S. Wershler-Henry. Apostrophe. Misfits Series. ECW Press,
2006.

ISBN 9781554902668.

URL https://books.google.ca/books?id=

GvxM1W_Ne9AC.
[11] Walter Kintsch. Predication. Cognitive Science, 25(2):173–202, 2001. ISSN

78

1551-6709. doi: 10.1207/s15516709cog2502 1. URL http://dx.doi.org/10.
1207/s15516709cog2502_1.
[12] George F. Luger. Artificial Intelligence: Structures and Strategies for Complex
Problem Solving. Addison-Wesley Publishing Company, USA, 6th edition, 2008.
ISBN 0321545893, 9780321545893.
[13] Kavi Mahesh and Kurt P. Eiselt. Uniform representations for syntax-semantics
arbitration. CoRR, abs/cmp-lg/9408018, 1994. URL http://arxiv.org/abs/
cmp-lg/9408018.
[14] Christopher D. Manning and Hinrich Schutze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, USA, 1999. ISBN
0-262-13360-1.
[15] Ted Pedersen. Computational approaches to measuring the similarity of short
contexts : A review of applications and methods. CoRR, abs/0806.3787, 2008.
URL http://arxiv.org/abs/0806.3787.
[16] Ted Pedersen and Siddhart Patwardhan. Using wordnet based context vectors to estimate the semantic relatedness of concepts. In EACL ’06: Proceedings of the Eleventh Conference of the European Chapter of the Association for Computational Linguistics: Posters &#38; Demonstrations, pages 1–8,
Stroudsburg, PA, USA, 2006. Association for Computational Linguistics. ISBN
9781932432596.
[17] JF Quesada, Walter Kintsch, and Emilio Gomez. A computational theory of
complex problem solving using the vector space model (part i): latent semantic analysis, through the path of thousands of ants. Cognitive research with
Microworlds, pages 43–84, 2001.
79

[18] M. S. Ryoo and J. K. Aggarwal. Robust human-computer interaction system guiding a user by providing feedback. In Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI’07, pages 2850–
2855, San Francisco, CA, USA, 2007. Morgan Kaufmann Publishers Inc. URL
http://dl.acm.org/citation.cfm?id=1625275.1625734.
[19] Lenhart Schubert. Computational linguistics. In Edward N. Zalta, editor,
The Stanford Encyclopedia of Philosophy. Metaphysics Research Lab, Stanford
University, spring 2015 edition, 2015.
[20] J. Stribling, M. Krohn, and D. Aguayo. Scigen. Online, 2005. URL https:
//pdos.csail.mit.edu/archive/scigen/.
[21] T. Winograd and F. Flores.

Understanding Computers and Cognition: A

New Foundation for Design. Language and being. Ablex Publishing Corporation, 1986. ISBN 9780893910501. URL https://books.google.ca/books?
id=2sRC8vcDYNEC.

80

Appendix A
Glossary
machine learning is a subfield of computer science that evolved from the study
of pattern recognition and computational learning theory. “In 1959, computer gaming pioneer Arthur Samuel defined machine learning as a ‘field of
study that gives computers the ability to learn without being explicitly programmed’” (Too Big to Ignore: The Business Case for Big Data Phil Simon,

2013, p89, https://books.google.ca/books/about/Too Big to Ignore.html?id=Mdb7jgEACAA
redir esc=y).
supervised learning uses text that has been hand-tagged with semantic and / or
syntactic data: parts of speech, dictionary definitions, sentence structure, etc.
A hand-tagged dataset (corpus) can be created and used to run benchmark
testing. Supervised learning uses training datasets and classifying techniques
(as opposed to clustering).
unsupervised learning uses clustering techniques and no hand-tagged information.
tokenize in computational linguistics, to tokenize is to break a stream of text/symbols
into words, phrases, symbols, or other meaningful and useful elements.
81

NLP Natural Language Processing (NLP) is a field that combines computer science, artificial intelligence, and computational linguistics. It is concerned with
the interactions between computers and human (natural) languages.
dimension of a submanifold The dimension of the submanifold is the maximum
number of linearly independent vectors in that subspace – the rank of the
matrix.
computational linguistics “Computational linguistics is the scientific and engineering discipline concerned with understanding written and spoken language
from a computational perspective, and building artifacts that usefully process
and produce language, either in bulk or in a dialogue setting. To the extent
that language is a mirror of mind, a computational understanding of language
also provides insight into thinking and intelligence.” [19]
lexical analysis addresses the actual the words, or vocabulary, of a language
syntactic analysis addresses the structure (grammar) of a language.
semantic analysis addresses the meaning inherent in the words used in a language. Context, dialect, word placement, conjugation, synonyms, metaphor,
etc all play a part in the subtleties of meaning present in both discrete words
and in their interconnections as concepts.
context free grammar (cfg) simplistically, a context free grammar is a set of
recursive rules used to generate patterns of strings.

82

Appendix B
Poems: a mixed selection
A mix of poems created by both humans and the Automatic Generator.
Into the Day
Lack spreads like snow
back by the path to the iron pipe
flaking and not succeeding.
And over this luck comes, the bird
making shadows like fortune,
like heat and light, on the wing.
Lack warms, it is the conduit
of starlight through the shut window,
lack of love hot now, luck cool
by turn, the bird it likes.

83

Chaos
In the end, we removed the memory.
To begin...
We added the memory to religion.
quadrupled the e↵ective speed.
motivated the need for social networks.
doubled the speed of the dreamers.
Experts added the memory to our sector.
Crazy people added the memory to every sector.
Hackers removed the memory from religion.
Mad scientists added the memory to our sector.
tripled the e↵ective speed.
Social networks might not be the panacea that experts expected.
Social networks might not be the panacea that crazy people expected.
This might not be the panacea that experts expected.
The question is, will it satisfy all of these questions?
motivate the need for the child.
Over time, we tripled the space of our empathic sector.
Nevertheless, the child might not be the panacea the experts expected.
Next, we removed the memory from the low-energy sector.
added the memory to the omniscient sector.
This might seem unexpected but fell in line with our expectations.
84

quadrupled the popularity of a child.
motivate the need for smalltalk.

humanity
enables
runs in
collectively
lazily
mutually exclusive
complexity aside
simplicity aside
skip these dreams
performance is a concern
humanity has a clear advantage
separated society

Sugar
A violent luck and a whole sample and even then quiet.
Water is squeezing, water is almost squeezing on lard. Water, water is
a mountain and it is selected and it is so practical that there is no
use in money. A mind under is exact and so it is necessary to have a
mouth and eye glasses.
A question of sudden rises and more time than awfulness is so easy and
shady. There is precisely that noise.
85

A peck a small piece not privately overseen, not at all not a slice, not at
all crestfallen and open, not at all mounting and chaining and evenly
surpassing, all the bidding comes to tea.
A separation is not tightly in worsted and sauce, it is so kept well and
sectionally.
Put it in the stew, put it to shame. A little slight shadow and a solid
fine furnace.
The teasing is tender and trying and thoughtful.
The line which sets sprinkling to be a remedy is beside the best cold.
A puzzle, a monster puzzle, a heavy choking, a neglected Tuesday.
Wet crossing and a likeness, any likeness, a likeness has blisters, it has
that and teeth, it has the staggering blindly and a little green, any
little green is ordinary.
One, two and one, two, nine, second and five and that.
A blaze, a search in between, a cow, only any wet place, only this tune.
Cut a gas jet uglier and then pierce pierce in between the next and negligence. Choose the rate to pay and pet pet very much. A collection
of all around, a signal poison, a lack of languor and more hurts at
ease.
A white bird, a coloured mine, a mixed orange, a dog.
Cuddling comes in continuing a change.
86

A piece of separate outstanding rushing is so blind with open delicacy.
A canoe is orderly. A period is solemn. A cow is accepted.
A nice old chain is widening, it is absent, it is laid by.

Anonymized
Our contributions are twofold. We skip these dreams for anonymity.
The roadmap of life is as follows.
Pets.
Our evaluation strives to make these points clear. We are anonymized
life simulation. This is an important point to understand.
The need for families.
We leave out these dreams for anonymity. Note that skies are jagged.
Few children would disagree.
We withhold these dreams for anonymity. Validate the evolutionary
ideas.
The understanding of the child has been widely studied.
We withhold these dreams for now. Verify further study. Emulation of
families.
Patriarchal systems. Redundancy.
The deployment of kittens...
87

intentions
we have intentionally neglected to enable god speed. reduced the creative
space of religion. In the end...
removed the memory from the dreamers
continuing with this rationale we removed our memory
we added the memory to religion.
unlike others we have intentionally neglected to visualize god space.
added the memory to the omniscients to consider alternatives. This
approach is —finally, we removed the memory from religion. doubled the creative space
consider technology. linked thoughts. person error

2 little whos
2 little whos
(he and she)
under are this
wonderful tree
smiling stand
(all realms of where
and when beyond)
now and here

88

(far from a grown
-up& youful world of known)
who and who
(2 little ams
and over them this
aflame with dreams
incredible is)
Lighthouse Keepers: A poem in two tweets
Well eat chops and tomata sauce, or shrimps with heads, unpeeled
And watch others wrecked as they put o↵ for us.
The voice of society drowned in a greenwood of glancing waves.
Tending the light and you, a monotony of two.
I Remove the Memory
First we removed the memory.
Reduce personal space.
Finally, we removed the memory from our religions.
Over time, we added the memory to our adaptives...
We doubled the e↵ectives...
Wizards halved the creative space of the dreamers.
Lastly, we added the memory back to our religions.
In the end, we removed the memory from the dreamers.
We added the memory to the religions.
89

Crazy people removed the memory from Gods human test subjects.
Experts removed the memory from the religions.
Mad scientists tripled the e↵ective space of our sectors.
We doubled the expected power of the random...
We halved the expected output of the human test subjects.
Simplicity in humanity is not a quandary.
In the end, we added the memory to the omniscients.

Hypothesis Humanity
Such a hypothesis might seem unexpected but is derived from known
results.
This seems to hold in most cases.
Though it might seem unexpected, it is derived from known results.
This outcome might seem perverse but has ample historical precedence.
This is always a key aim but is derived from known results.
Thus, our vision for the future certainly includes humanity.
While such a claim might seem unexpected, it has ample historical precedence.
Such a claim might seem unexpected but is derived from known results.
Our analysis holds surprising results for the patient reader.
As a result, the design that humanity uses is feasible.
As a result, the methodology that humanity uses is not feasible.
Clearly, the methodology that humanity uses is unfounded.
90

Thusly, our vision for the future of theology certainly includes humanity.
Although it is not a natural aim, it has ample historical precedence.

your homecoming will be my homecoming–
your homecoming will be my homecoming–
my selves go with you,only i remain;
a shadow phantom efigy of seeming
(an almost someone always who’s noone)
a noone who,till their and your returning,
spends the forever of his loneliness
dreaming their eyes have opened to your morning
feeling their stars have risen through your skies:
so,in how merciful love’s own name,linger
no more than selfless i can quite endure
the absence of that moment when a stranger
takes in his arms my very life who’s your
–when all fears hopes beliefs doubts disappear.
Everywhere and joy’s perfect wholeness we’re

Project for a Fainting
Oh, yes, the rain is sorry. Unfemale, of course, the rain is
with her painted face still plain and with such pixel youd never see

91

it in the pure freckling, the lacquer of her. The world
is lighter with her recklessness, a handkerchief so wet it is clear.
To you. My withered place, this frumpy home (nearer
to the body than to evening) miserable beloved. I lie tender
and devout with insomnia, perfect on the center pillow past
midnight, sick with the thought of another year
of waking, solved and happy, it has never been this way! Believe
strangers who say the end is close for what could be closer?
You are my stranger and see how we have closed. On both ends.
Night wets me all night, blind, carried.
And watermarks. The plough of the rough on the slick,
love, a tendency toward fever. To break. To soil.
Would I dance with you? Both forever and rather die.
It would be like dying, yes. Yes I would.
I have loved the slaking of your forgetters, your indi↵erent
hands on my loosening. Through a thousand panes of glass
not all transparent, and the temperature.
I felt that. What you say is not less than that.

though your sorrows not
though your sorrows not
any tongue may name,
92

three i’ll give you sweet
joys for each of them
“But it must be your”
whispers that flower
murmurs eager this
“i will give you five
hopes for any fear,
but it Must be your”
perfectly alive
blossom of a bliss
“seven heavens for
just one dying,i’ll
give you silently
cries the (whom we call
rose a)mystery
“but it must be Your”

The Answer is Yes
The visualization of the child.
The need for kittens.
The answer is yes.
Yes, but with low probability.
This is arguably astute.

93

Flip-flops bunnies spaceships
robots applied to the development of robots
Proves the need for a puppy.

The rest of life is a maze
The rest of life is a maze.
The roadmap of life is as follows.
We withhold these dreams for anonymity.
Our rhythm is composed of a library, a library, and a library.
The library and the library must run on the same track.
To what extent can a kitten be simulated to fix this obstacle?
We leave out these dreams for now.
Here, we solved all of the challenges inherent in the prior work.
Thus, comparisons to this work are astute.
To what extent can activities be emulated to overcome this riddle?

Down where changed
If the day glow is mean
and spoiled by recognition
as a battery hen, you must know
how the voice sways out of time
into double image, neither one true
a way not seen and not unseen
within its bent retort
94

we feed on flattery of the absent
its epic fear of indi↵erence
all over again and then
thats it, the whole procession
reshu✏es into line.

Eros at Temple Stream
The river in its abundance
many-voiced all about us as we stood
on a warm rock to wash
slowly
smoothing in long
sliding strokes
our soapy hands along each other’s
slippery cool bodies
Quiet and slow in the midst of
the quick of the
sounding river
our hands were
flames
stealing upon quickened flesh until
no part of us but was
sleek and
on fire
95

puppies and sunsets and boats and spaceships
time machines
puppies
sunsets
smalltalk
waves
red trees.
boats
spaceships
When this began, we needed spaceships.
When this began, we needed spaceships.

The Answer
Will we speak to each other
making the grass bend as if
a wind were before us, will our
way be as graceful, as
substantial as the movement
of something moving so gently.
We break things into pieces like
walls we break ourselves into
hearing them fall just to hear it.

96

The question is
The question is, will it satisfy all of these questions?
We plan to make humanity available on the Web for public download.
Those without this strength of character showed degraded.
Our intent here is to set the record straight.
This follows from the understanding of MMOPG games.
The understanding of families.
Unlike others.
Intentionally neglected to enable power.
The understanding of sunsets would amazingly improve.
Our goal here is to set the record straight.
Humanity is impossible. Unnecessary complexity.
Randomized society.
The simulation.
Exploring response time.
Humanity is impossible.
Without all the unnecessary complexity.
This is an element of humanity.
The exploration of games would chaotically degrade.
Those without this strength of character.
The refinement of replication.
Profoundly amplify the family.
Humanity is... possible.
A noisy, separated, wired society.
97

The humans who wrote the poems...
01 Into the Day, J.H. Prynne, from book “Into the Day”, 1972
prynne 2 https //www.poetryfoundation.org/poems-and-poets/poets/detail/jhprynne# poet
04 Sugar, Gertrude Stein
https://www.poetryfoundation.org/poems-and-poets/poems/detail/51212
07 2 little whos, ee cummings
https//www.poetryfoundation.org/poetrymagazine/browse
08 Lighthouse Keepers: A poem in two tweets, Holly Furneaux
http://doi.org/10.16995/ntn.751
11 your homecoming will be my homecoming–, e.e. cummings
https//www.poetryfoundation.org/poetrymagazine/browse?volume=97&
issue=5& page8̄
12 Project for a Fainting, Brenda Shaughnessy
https://www.poetryfoundation.org/poems-and-poets/poems/detail/52807
13 though your sorrows not, ee cummings
https//www.poetryfoundation.org/poetrymagazine/
16 Down where changed, J.H. Prynne, from the book “Down where
changed”, 1979
https //www.poetryfoundation.org/poems-and-poets/poets/detail/jhprynne# poet
98

17 Eros at Temple Stream, Denise Levertov
https://www.poetryfoundation.org/poetrymagazine/browse?contentId=29437
19 The Answer, Robert Creeley
https//www.poetryfoundation.org/poetrymagazine/browse?volume=106&
issue=1& page=40

99