Disaster Area

Letter Frequencies in Names of Virginia Cities

Recently I’ve noticed on Facebook a couple of trivia posts asking you to try to name a city in Virgina with a certain letter in it. As background, thses posts are coming from my friends living in Virginia. You also have to be at least somewhat familiar with Virginian municipal structure. Like other states, Virginia is divided up into counties. In most other states, the next step down is cities and/or towns. In Virginia, cities are independant and on the same level as counties. So the bar for being a city is pretty high, and many people can name a decent number of them if they put their mind to it. I wanted to find out if there were any letters not used in any cities. And because (like many programmers) I’m lazy and didn’t want to look through all of the cities manually I turned to Python to help me.

Start with some basic setup. Counter is a counted set. pprint provides a nicer way to print lists and other containers by adding line breaks and such. BeautifulSoup is a really easy to use HTML parser+navigator. Requests makes it easy to download things from the web.

from collections import Counter
import string
from pprint import PrettyPrinter

from bs4 import BeautifulSoup
import requests

# A prettier way to print
pprint = PrettyPrinter().pprint

Wikipedia provides a nice list of the cities in Virginia. I find all the tables (<table> elements) on the page, and the second one is the one I want. I then get a list of the text of all links (<a> elements) in that table.

city_response = requests.get('http://en.wikipedia.org/wiki/Cities_in_virginia')
city_soup = BeautifulSoup(city_response.content)

tables = city_soup.find_all('table')
# Python starts counting at 0 like many other programming languages
table = tables[1]
# There are some blank links, so drop those
cities = [link.string for link in table.find_all('a') if link.string]

Now run through the list of cities, and add each one to a Counter. Then I print out how many cities there are and a list of the letters used, sorted by how often they were used.

city_count = Counter()
for city in cities:
    # Add each letter of the lowercase city name to the Counter
    city_count.update(city.lower())

print("{} cities".format(len(cities)))
pprint(city_count.most_common())
39 cities
[('a', 35),
 ('r', 31),
 ('o', 30),
 ('n', 29),
 ('e', 27),
 ('s', 25),
 ('l', 24),
 ('i', 23),
 ('t', 17),
 ('h', 14),
 ('u', 11),
 ('c', 11),
 ('b', 10),
 ('f', 10),
 ('g', 10),
 ('p', 9),
 ('m', 9),
 ('d', 8),
 ('k', 7),
 ('v', 6),
 ('w', 6),
 (' ', 6),
 ('x', 4),
 ('y', 2),
 ('q', 1)]

I guess space (‘ ‘) counts as a letter, for those cities that are two words. Now I want to see which letters aren’t used . To do this I make a set of the letters that were found and subtract it from a set of all lowercase characters.

Sidenote: Confused from earlier where I said that Counter was a counted set? Well, it is but it isn’t a set so the set operators won’t work on it.

not_in_cities = set(string.ascii_lowercase) - set(city_count)
pprint(not_in_cities)
{'z', 'j'}

Bet you can’t name a city that contains the letter ‘J’!

And because they’re pretty big, let’s look at incorporated towns. Again, Wikipedia has a list of them that I’ll use. This list isn’t in a table, so it’s a bit harder to pull out.

town_response = requests.get('http://en.wikipedia.org/wiki/List_of_towns_in_Virginia')
town_soup = BeautifulSoup(town_response.content)

First I grab the main content area. Wikipedia displays the actual town names as elements (<li> elements) within unordered lists (<ul> elements), one for each letter in the alphabet. Because there are some other unordered lists in the content, I find all unordered lists that are only one level down (this is starting to get into the specifics of HTML, but it’ll be over soon). Then I go through each list, and save the link text for the first link I find in each list item. After than, the process is the same as the cities.

town_content = town_soup.find('div', id='mw-content-text')
lists = [element for element in town_content.find_all('ul') if element.parent == town_content]
towns = []
for ul in lists:
    for element in ul.find_all('li'):
        town_link = element.a
        # The references list has rel="nofollow", and we don't want the references
        if 'rel' not in town_link.attrs:
            towns.append(town_link.string)

for town in towns:
    town_count.update(town.lower())

print("{} towns".format(len(towns)))
pprint(town_count.most_common())
190 towns
[('e', 310),
 ('a', 278),
 ('l', 272),
 ('n', 258),
 ('o', 248),
 ('r', 232),
 ('i', 216),
 ('t', 200),
 ('s', 174),
 ('c', 154),
 ('h', 102),
 ('d', 100),
 ('u', 94),
 ('b', 88),
 (' ', 82),
 ('g', 82),
 ('m', 70),
 ('p', 66),
 ('w', 66),
 ('v', 64),
 ('y', 64),
 ('k', 60),
 ('f', 38),
 ('x', 16),
 ('j', 6),
 ('.', 4),
 ('q', 4),
 ('z', 2)]

not_in_towns = set(string.ascii_lowercase) - set(town_count)
pprint(not_in_towns)
set()

set() means that every letter is used (the ‘.’s are from St. Charles and St. Paul). Now I want to see what the counts are for the cities and towns combined. I create a new Counter from the cities` and then add the towns’.

city_town_count = Counter(city_count)
city_town_count.update(town_count)
pprint(city_town_count.most_common())
[('e', 337),
 ('a', 313),
 ('l', 296),
 ('n', 287),
 ('o', 278),
 ('r', 263),
 ('i', 239),
 ('t', 217),
 ('s', 199),
 ('c', 165),
 ('h', 116),
 ('d', 108),
 ('u', 105),
 ('b', 98),
 ('g', 92),
 (' ', 88),
 ('m', 79),
 ('p', 75),
 ('w', 72),
 ('v', 70),
 ('k', 67),
 ('y', 66),
 ('f', 48),
 ('x', 20),
 ('j', 6),
 ('q', 5),
 ('.', 4),
 ('z', 2)]