Namepy step 6 – Load the data into the database

(This is part of the namepy project. Start at Namepy – on the shoulders of giants)

We will need the following data in the database:

  • Name frequencies – baby names by year
  • Scrabble™ letter values, by (country) Scrabble™ set

Name frequencies

  1. Download the data from https://www.ssa.gov/oact/babynames/names.zip
  2. Unzip it in <project_root>/raw_data/yob1880.txt, etc
  3. You may want to add raw_data to .gitignore, so it doesn’t get stored in your git repo
  4. Create some code to read files and store in PostgreSQL – read_name_frequencies.py
    def read_frequencies_from_file(filename, names):
        print(filename)
        year = int(filename[3:7])
    
        year_frequencies = {}
        for name in names:
            year_frequencies[name] = {'F': 0, 'M': 0}
    
        with open('raw_data/%s' % filename) as file:
            for line in file.readlines():
                try:
                    name_text, sex, count = line.split(",")
                except:
                    print("Couldn't parse line")
                    print(line)
                    print
                    continue
    
                if name_text not in names:
                    name = Name(name=name_text)
                    db.session.add(name)
                    db.session.commit()
                    names[name_text] = name.id
                    year_frequencies[name_text] = {'F': 0, 'M': 0}
    
                year_frequencies[name_text][sex] = int(count)
    
            for name, name_frequency in year_frequencies.iteritems():
                if name_frequency['F'] + name_frequency['M']:
                    name_id = names[name]
                    frequency_record = NameFrequency(name_id=name_id,
                        year=year,
                        boys_count=name_frequency['M'],
                        girls_count=name_frequency['F'])
                    db.session.add(frequency_record)
                    db.session.commit()
    
    def read_name_frequencies():
        db.create_all()
    
        # Start with an empty list
        print("Deleting any previous data")
        db.session.query(NameFrequency).delete()
        db.session.query(Name).delete()
        db.session.commit()
        print("Done")
    
        names = {}
    
        # Get file list
        for filename in listdir('raw_data'):
            if filename[:3] == 'yob':
                read_frequencies_from_file(filename, names)
    
  5. Run the code. Note that this may take a while to run. On my development machine it took about 8 minutes
  6. Check this in the database, for instance with phpPgAdmin or pgAdmin

Scrabble™ letter values

For a list of Scrabble™ letter values by Scrabble™ set see this Wikipedia entry. The following code will grab this page, extract the letter values and save this in the database

  1. Add the new tables to models.py
    class Set(db.Model):
        id = db.Column(db.Integer, primary_key=True)
        name = db.Column(db.String())
        scores = db.relationship('LetterScore', backref='set', lazy='dynamic')
    
    class LetterScore(db.Model):
        id = db.Column(db.Integer, primary_key=True)
        set_id = db.Column(db.Integer, db.ForeignKey('set.id'))
        score = db.Column(db.Integer)
        letter = db.Column(db.String(1))
    
  2. Write some code to parse this page and store the results in the database. See the source code in my GitHub repo
  3. Run the code
  4. Check the results in the database. See above (end of name frequencies section) for some suggested tools

Done

Next

Time to pull it all together and show some real charts

Continue to Step 7 – Bringing it all together