An Exploration in Natural Language Processing
By Gene Wilburn
Just as I began learning basic techniques for natural language processing (sometimes called “computational linguistics”) in the Python programming language, I read that F. Scott Fitzgerald’s Great Gatsby had been released into the public domain. As an English major (B.A., M.A.) who had pivoted into a career in IT, this attracted me like a folksinger to a new acoustic guitar. I knew I had to try out my new licks on Gatsby, so I downloaded the plain text version of the novel from Project Gutenberg.
As is the case with most lit majors, I’m a word addict, so I thought I’d pull out some of Fitzgerald’s vocabulary to see if it was in any way exceptional — meaning the words he used, not the deft way he put them together in his classic American novel. This was not intended as a form of literary criticism. It’s more of a word-watcher’s curiosity about how a gifted writer used his word hoard.
By extracting all the individual words from the plain text form of the novel, then putting it through a stop list of words like a, an, and, or, but, the, etc., I had a working list of relevant words. Using the resources of the excellent NLTK (natural language toolkit) module (NLTK Book), I was able to prepare a frequency distribution list that could be used to highlight the most frequently used words as well as those used only once or a few times.
It turned out that words Fitzgerald used most were not particularly interesting or enlightening. Examining the words used 50 or more times one sees common words like I, she, he, said, Gatsby, Tom, Daisy, you, house, car, get, and something. Nothing particularly inspiring.
It then occurred to me that it might be much more interesting to look at Fitzgerald’s least used words, hoping to find some fancier words he used only occasionally. In addition to its .FreqDist() method, NLTK also has a method called .hapaxes(). This derives from the Greek expression Hapax legomenon meaning, literally, “something said only once.” The method, appropriately, flags all words used only once in the novel.
This immediately produced more interesting results, as varied as adventitious, amorphous, aquaplanes, vestibules, and wall-scaling, along with common words used only once. By experimenting with selecting various degrees of frequency, I found that the most interesting all-around list was obtained by including all words used three times or fewer.
Although this was interesting, it seemed to me that it would be doubly interesting to a word hound to compare The Great Gatsby with another novel of the same period. I thought of Hemingway, since Fitzgerald and Hemingway were friends, but Hemingway’s use of simple vocabulary makes him less interesting in terms of the actual words he employed.
Gatsby was published in 1925. By coincidence I had just finished reading a 1926 murder mystery novel called The Benson Murder Case, by S.S. Van Dine, the pseudonym for American art critic and writer Willard Huntington Wright, an erudite writer whose detective, Philo Vance, was at one time highly popular with readers and who was featured in several Hollywood films. Amateur detective Vance, a kind of American version of the British sleuth Lord Peter Wimsey, was played in films by actors William Powell (before his Nick Charles period), Basil Rathbone, and Edmund Lowe (Wikipedia, “S.S. Van Dine”). Nick Caraway rubbed shoulders with the rich. Vance was a member of rich NewYork society, and an art collector, and, as such, had a remarkably sophisticated, at times foppish, vocabulary. The novel sent me to the dictionary several times to look up new words.
I purchased an Epub edition of S.S. VAN DINE Premier Collection: Thriller Classics, Murder Mysteries, Detective Tales & More and extracted the text of The Benson Murder Case and put it through the same lexical procedures as Gatsby, likewise limiting the word list to words used three times or fewer. I then converted all the words to lower case, alphabetized both lists, and filtered the two lists together using a Unix/Linux word utility called comm. What it did was put the results in three columns. Words used only by Fitzgerald, words used only by Van Dine, and words used by both. The full list is here.
I then imported the list into Google Docs and exported it as an Epub file that I loaded into Apple Books on my iPad. This allowed me to do a leisurely read through the list and highlight words from each author that struck me as being “interesting” and at least slightly out of the ordinary. When I had finished scanning, I manually copied the results for each author into the listings below:
Great Gatsby (1925)
abortive, adventitious, aluminium, amorphous, aquaplanes, araby, asunder, beluga, cahoots, caravansary, caterwauling, chartreuse, coney, convivial, crêpe-de-chine, debauchee, demoniac, dilatory, distraught, divot, dog-days, duckweed, echolalia, ectoplasm, euphemisms, expostulation, fishguards, flounced, foxtrot, fractiousness, grail, harlequin, holocaust, hornbeams, hors-d’oeuvre, humidor, inconsequence, inessential, jonquils, juxtaposition, knickerbockers, lustreless, meretricious, nonolfactory, obstetrical, pasquinade, petrol-pumps, plagiaristic, platonic, pneumatic, portentous, postern, prig, probity, rot-gut, rotogravure, sea-change, sheik, somnambulatory, staid, substantiality, subterfuges, teutonic, vestibule, wall-scaling, whitebait
Benson Murder Case (1926)
a-flutter, a-kimbo, acerbities, adipose, amasis, animadversions, approbation, aquiline, argot, arrentine, astigmatic, badinage, ballyrag, bezique, bisonic, brachycephalic, bunjinga, burglarious, casuistic, champêtre, chef-d’œuvre, cinquecento, cloisonné, confab, contretemps, craniological, darwinian, davenport, deltoids, derring-do, déshabillé, diatonic, discommode, disharmonious, dissolution, dolichocephalic, dulcet, dyspnœa, ebullition, embayed, emulsification, endocrines, factitious, factotum, flâneur, flummery, forensic, garrulous, gewgaws, halcyon, hauteur, hedonist, helixometer, hirsute, mpecunious, imputation, inamorato, infinitesimal, ingress, inspissated, joss-sticks, juxtaposition, lambrequin, leptorhine, lèse-majesté, lineaments, loquacious, lugubriously, mandragora, mêlée, mellifluously, mock-turtle, modish, moue, myrmidons, obduracy, orthognathous, oubliettes, palaver, peccadilloes, perfeccionados, perspicacious, phrenologist, platitudinarian, plebeian, polychrome, popinjay, prognathous, protasis, puerility, quavering, quixotic, rapprochement, ratiocination, redolent, remonstrances, repine, reproche, rubicund, sabreur, sardonic, sententiously, sequester, smouldering, sobriquet, soirée, somnolently, soupçon, stertorous, suave, sybarite, sycophant, syllogism, tenter-hooks, tête-à-tête, teutonic, tonneau, totemistic, triptych, truculent, tutelary, twitted, ventral, vestibule, viscid, vitiated, vituperation, vortices, what-for, whirlin’-dervish
There are a couple of things one might conclude from comparing the lists. The first is that F. Scott Fitzgerald did not use an especially challenging vocabulary for The Great Gatsby. This makes the novel suitable for readers of younger ages, say high school or first-year university students. The second is that you don’t need fancy words to create a masterpiece. Gatsby has stood the test of time.
S.S. Van Dine, though once highly popular, has faded into relative obscurity. Part of that may be attributed to his more challenging vocabulary and part to the writing itself, which is slow-paced for a detective novel.
Those of us who are addicted to detective fiction are used to authors with large vocabularies and, in fact, if a work of detective fiction doesn’t offer some word challenges, it’s disappointing to the reader.
The bottom line: S.S. Van Dine walks away with the prize for most interesting vocabulary, while F. Scott Fitzgerald walks away with a literary masterpiece. A generous reader can enjoy both.