And this weekend, I came up with the perfect solution. A Markov chain random text generator. A Markov chain is a random process in which the next state is dependent only on the current state. In the case of a Markov chain random text generator, the current state is the last n characters (or words) generated. My program works with characters, but the concept applies exactly the same to words.
A Markov chain text generator first reads some sample text. From this it builds kind of probability table of which letters follow which other letters more or less frequently. Then it generates the letters (considering only the last n letters generated), weighting them so the more common letters and letter combinations occur more frequently.
My particular generator is slightly different than most others because it knows how to begin and end. Most other Markov text generators are used to make large blocks of text based off of entire books. Mine is to be used for short names. The beginning and end matter more, and it's good to have a name that ends in a logical way such as "ton" or "ville" instead of just being cut off in the middle. So I treat the end of the name as a character the same as the rest. It gets put in the chart the same as other and it gets generated the same as others, except that when it's generated, the name is done and it stops generating more.
Here's an example of how it works, using a name it actually generated: San Francinattle. For early testing, I used a list of the 100 largest cities in the US, including San Francisco, Cincinatti and Seattle. I'll only consider those three for simplicity. More names affects the probabilities involved, but not the principles. In this example, I'm using order = 3.
For the first letter, two names have "S" and one has "C", so it has a 66.7% chance of generating "S". With just "S" it has a 50-50 shot of producing "e" or "a" next. With our limited dataset, once "Sa" is generated, the only next option is 'n'. That continues until we've generated "San Franci". The order is 3, so it only considers "nci" when generating the next letter. That letter combination was in Cincinatti, in addition to San Francisco. It doesn't care about what came before this, so it has a 50-50 chance of generating an "s" or an "n", and in this case it went with "n". Now with this combination of letters, Cincinatti is the only game in town, so it has to stick with it for the next four letters, but when the last generated letters are "att" then it matches Seattle. At which point, it finishes up the city name.
And so, a brand new city name, inspired by other city names, and following English spelling convention.
After some successful test runs, I compiled a list of over 19,000 US city names (information provided by your friendly neighborhood census bureau (and a little bit of programming to strip out unnecessary parts)).
Here are some of the names generated with order = 1:
Orrdak LansonOrder = 2:
Pachelug
Mior Juengsperes Beleckbumonty
Qubrt Blle
Nesoala Isvid
Splllay
War Hilahe Crgeyrtosn
Rionnengh Fr
Ibbon
Sove
St. CimandsonisondOrder = 3:
Rippnevillsbortham
Mase
Vinge
Easton
Blue Haveroe
Peenvilley
Mintond
Forwing
Willinbold
Wod
Bly
New Citeradermon
TonkeenOrder = 4:
Chur
Big Vall
Pleadis
Summitol
McRae
Roscomb
McBainfieldersvillenwood
Meside
Uphalittles
Glens
South Burleve Porth
Runa
Hess Hillage Placket
Belph
Warway
WashtonIt's amazing how few orders are between unpronounceable gobbledygook and reproducing sample names exactly. Orders 2 and 3 seem pretty good to me though. It produces names which are easy to imagine could be real, but probably aren't.
Brook
Iona
Denver
Graham
Buncombes
Dewey
Mario
Dolanagan
Onster
Hampton
Magner
Greenview
Bee City
Saling Gretna
I also got a list of about 2,000 German cities. Here are some names generated using those (order = 3):
Alzenbach
Coswintern
Völklingen
Ludwig
Kötheim
Güstralsrode
Löwentha
Pappenhüttenburg
Kalbernburg
Wolgau
Geislingen
That's pretty cool. :)
ReplyDelete