Category Archives: Databases

Data Analysis with Python

Python is a very popular tool for data extraction, clean up, analysis and visualisation. I’ve recently done some work in this area, and would love to do some more. I particularly enjoy using my maths background and creating pretty, clear and helpful visualisations

  • Short client project, analysing sensor data. I took readings from two accelerometers and rotated the readings to get the relative movement between them. Using NumPy, Pandas and MatplotLib, I created a number of different charts, looking for a correlation between the equipment’s setting and the movement. Unfortunately the sensors aren’t sensitive enough to return usable information. Whilst not the outcome they were hoping for, the client told me “You’ve been really helpful and I’ve learned a lot”
  • At PyCon UK (Cardiff, September 2018) I attended 14 data analysis sessions. It was fascinating to see the range of tools and applications in Python data analytics. At a Bristol PyData MeetUp I summarised the sessions in a 5 minute lightening talk. This made me pay extra attention and keep useful notes during the conference
  • Short client project, researching best way to import a large data set, followed by implementation. The client regularly accesses large datasets using a folder hierarchy\to structure that data. They were looking to replace this with a professional database, i.e. PostgreSQL. I analysed their requirements, researched the different storage methods in PostgreSQL, reported my findings and created an import script.

Grafana, InfluxDB and Python, simple sample

I recently came across an interesting contract position which uses Grafana and InfluxDB. I’d had a play with ElasticSearch before, and done some work with KairosDB, so was already familiar with time series and json-based database connections. Having manually created a dashboard, Grafana looked rather interesting. So I thought I’d do a quick trial – generate some random data, store it in InfluxDB and show it with Grafana

Starting with a clean virtual machine:

InfluxDB

  1. Set up InfluxDB
    1. I followed InfluxDB’s installation instructions, which worked first time without any problems
    2. Start it
      sudo /etc/init.d/influxdb start
      
  2. Test InfluxDB
    influx
    > create database mydb
    > show databases
    name: databases
    ---------------
    name
    _internal
    mydb
    
    > use mydb
    > INSERT cpu,host=serverA,region=us_west value=0.64
    > SELECT host, region, value FROM cpu
    name: cpu
    ---------
    time            host    region  value
    1466603916401121705 serverA us_west 0.64
    
  3. Set up and test influxdb-python, so we can access InfluxDB using Python
    sudo apt-get install python-pip
    pip install influxdb
    python
    >>> import influxdb
    >>>
    
  4. Run through this example of writing and reading some InfluxDB data using Python
    >>> from influxdb import InfluxDBClient
    >>> json_body = [
    ...     {
    ...         "measurement": "cpu_load_short",
    ...         "tags": {
    ...             "host": "server01",
    ...             "region": "us-west"
    ...         },
    ...         "time": "2009-11-10T23:00:00Z",
    ...         "fields": {
    ...             "value": 0.64
    ...         }
    ...     }
    ... ]
    >>> client = InfluxDBClient('localhost', 8086, 'root', 'root', 'example')
    >>> client.switch_database('mydb')
    >>> client.write_points(json_body)
    True
    >>> print client.query('select value from cpu_load_short;')
    ResultSet({'(u'cpu_load_short', None)': [{u'value': 0.64, u'time': u'2009-11-10T23:00:00Z'}]})
    
  5. Create some more data, using a slimmed down version of this tutorial script
    import argparse
    
    from influxdb import InfluxDBClient
    from influxdb.client import InfluxDBClientError
    import datetime
    import random
    import time
    
    
    USER = 'root'
    PASSWORD = 'root'
    DBNAME = 'mydb'
    
    
    def main():
        host='localhost'
        port=8086
    
        nb_day = 15  # number of day to generate time series
        timeinterval_min = 5  # create an event every x minutes
        total_minutes = 1440 * nb_day
        total_records = int(total_minutes / timeinterval_min)
        now = datetime.datetime.today()
        metric = "server_data.cpu_idle"
        series = []
    
        for i in range(0, total_records):
            past_date = now - datetime.timedelta(minutes=i * timeinterval_min)
            value = random.randint(0, 200)
            hostName = "server-{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}d" {d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb} random.randint(1, 5)
            # pointValues = [int(past_date.strftime('{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}s')), value, hostName]
            pointValues = {
                    "time": past_date.strftime ("{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}Y-{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}m-{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}d {d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}H:{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}M:{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}S"),
                    # "time": int(past_date.strftime('{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}s')),
                    "measurement": metric,
                    'fields':  {
                        'value': value,
                    },
                    'tags': {
                        "hostName": hostName,
                    },
                }
            series.append(pointValues)
        print(series)
    
        client = InfluxDBClient(host, port, USER, PASSWORD, DBNAME)
    
        print("Create a retention policy")
        retention_policy = 'awesome_policy'
        client.create_retention_policy(retention_policy, '3d', 3, default=True)
    
        print("Write points #: {0}".format(total_records))
        client.write_points(series, retention_policy=retention_policy)
    
        time.sleep(2)
    
        query = 'SELECT MEAN(value) FROM "{d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb}s" WHERE time > now() - 10d GROUP BY time(500m);' {d34bf16ac7b745ad0d2811187511ec8954163ba9b5dbe9639d7e21cc4b3adbdb} (metric)
        result = client.query(query, database=DBNAME)
        print (result)
        print("Result: {0}".format(result))
    
    if __name__ == '__main__':
        main()
    
  6. Save as create_sample_data.py, run and test it
    python create_sample_data.py
    ......
    influx
    Visit https://enterprise.influxdata.com to register for updates, InfluxDB server management, and monitoring.
    Connected to http://localhost:8086 version 0.13.0
    InfluxDB shell version: 0.13.0
    > use database mydb
    > SELECT MEAN(value) FROM "server_data.cpu_idle" WHERE time > now() - 10d GROUP BY time(500m)
    time			mean
    1466280000000000000	94.03846153846153
    1466310000000000000	98.47
    1466340000000000000	95.43
    1466370000000000000	104.3
    1466400000000000000	104.01
    1466430000000000000	114.18
    1466460000000000000	106.19
    1466490000000000000	96.67
    1466520000000000000	107.77
    1466550000000000000	103.08
    1466580000000000000	100.53
    1466610000000000000	94
    

Grafana

  1. Install Grafana using the installation instructions:
    $ wget https://grafanarel.s3.amazonaws.com/builds/grafana_3.0.4-1464167696_amd64.deb
    $ sudo apt-get install -y adduser libfontconfig
    $ sudo dpkg -i grafana_3.0.4-1464167696_amd64.deb
    
  2. Start the server and automatically start the server on boot up
    sudo service grafana-server start
    sudo systemctl enable grafana-server.service
    
  3. Test
    1. In your browser, go to localhost:3000
    2. Log in as (user) admin, (password) admin
  4. Connect to the InfluxDB database
    1. I followed the Instructions at http://docs.grafana.org/datasources/influxdb/
    2. Click on the Grafana icon
    3. Select “Data Sources”
    4. Click on “+ Add data source”
      1. Name: demo data
      2. Type: InfluxDB
      3. URL: http://localhost:8086
      4. Database: mydb
      5. User: root
      6. Password: root
      7. Click on “Save and Test”
    5. Create a new Dashboard
      1. Click on the Grafana icon
      2. Select “Dashboards”
      3. Click on “New”
    6. Define a metric (graph)
      1. Click on the row menu, i.e. the green icon (vertical bar) to the left of the row
      2. Select “Add Panel”
      3. Select “Graph”
      4. On the Metrics tab (selected by default)
        1. Click on the row just below the tab, starting with “> A”
        2. Click on “select measurement” and select “server_data.cpu_idle”
          1. You should now see a chart
        3. Close this, by clicking on the cross, top right hand corner of the Metrics panel
    7. Save the dashboard
      1. Click on the save icon (top of the screen)
      2. Click on the yellow star, next to the dashboard name (“New dashboard”)
    8. Test it
      1. In a new browser tab or window, go to http://localhost:3000/
      2. Log in (admin, admin)
      3. The “New dashboard” will now show up in the list of starred dashboards (and probably also under “Recently viewed dashboards”)
      4. Click on “New dashboard” to see the chart

You should now see something like this:

Grafana InfluxDB

Namepy – on the shoulders of giants

Whilst my core skill/tool is Python, I’m always learning new things, either inside or outside the Python ecosystem. I recently had the pleasure of working with Angular and Python/Flask. Here is a playful application based on these, plus Highcharts.

Going through “Python for Data Analysis”, some of the examples use a database of frequency of (US) baby names since 1880. I thought I’d combine this with a bit of Scrabble™.

In the Python world it’s common to add “py” to a word when making up names, so I’m calling this project “namepy”.

Since I’ll be using various frameworks and libraries, all created by others, I’ve subtitled this “On the shoulders of giants”.

Taking small steps often results in faster progress, so that’s what I’m be doing here.

Technical set up

The source code is at https://github.com/CoachCoen/namepy, with one branch per step.

Many production sites Content Delivery Networks for serving Javascript frameworks and libraries, usually minified, which helps to take the load of the server and may speed up first page load. To keep things simple and stable over time, I’m using full-sized, downloaded, copies.

I’m using WebFaction (affiliate link) as the host, since they make it easy to create Flask, Django and similar projects. And, as a popular host for developers, you’ll find lots of helpful documentation for developers online.

Getting started

Create a project folder

mkdir namepy
cd namepy

At the start of each of the steps

cd (my folder for personal projects)
cd namepy
git clone https://github.com/CoachCoen/namepy.git -b step1 step1

Note: “-b step1” specifies the name of the branch to clone. The second “step1” is the target folder, i.e. namepy/step1.

Next

Continue to Step 1 – Angular “Hello World”

Investment Tracking System – Django/Python

My client, a start up with a lot of experience in their field, had identified an important gap in the market. Large sums of money were being invested, with very long payback periods, without access to effective performance tracking tools.

They designed a tool to cover the gap and asked me to create a demonstration system in preparation for generating interest and raising capital.

I developed the system in Django, Python, PostgreSQL and Javascript. The front end uses a dashboard template based on Bootstrap and jQuery. Graphs are created using the excellent Highcharts charting library.

The resulting system imports the base data and generates monthly cost and revenue forecasts, taking into account seasonal variations, tax allowances and more.

Selection_005

The main management screen gives quick access to some key performance indicators.

Selection_006

Constraints can be defined, and potential investments can be checked against them.

Selection_007

Actual results can be compared against the projections.

Selection_008

Different heat maps show absolute or relative performance by state or county.

Selection_009

This was an eight month intensive project, resulting in a demo site which generated a lot of interest in the industry and allowed the client to achieve their first round of funding.

Flask and Angular on Heroku

I am working my way through this excellent tutorial, covering Python3, Flask, Angular, Heroku, SQLAlchemy, Alembic, requests, Beautiful Soup, NLTK, Redis and D3. Here are some extra notes

  • To stop me from blindly copying/pasting the code, I printed off the tutorial and worked from the paper version.
  • I had some problems installing Virtualenvwrapper (on Linux Mint 17.2), until I followed these instructions
  • I had some clashes with Anaconda
    • Virtualenvwrapper’s deactivate clashed with Anaconda’s deactivate. Prompted by these instructions I renamed ~/anaconda/bin/activate and ~/anaconda/bin/deactivate
    • “pip install psycopg2” resulted in:
      Error: “setuptools must be installed to install from a source distribution”
      After much experimentation I guessed that this might be due to Anaconda. I created a new virtual machine (without Anaconda) and re-started the tutorial. This fixed the psycopg2 problem

Part 1, set up Heroku

  • I used a free Heroku account. Between a dedicated server, a WebFaction account and a HotDrupal account I’m already paying enough for hosting
  • “heroku create wordcounts-pro” gave me an error “Name is already taken”. According to this Heroku page,  app names are in the global namespace, so I guess I’m not the first one to follow this tutorial. To work around this, I prepended the app name with my initials, i.e. “heroku create cdg-wordcounts-pro”, etc
  • So I can push the changes to heroku I set up public key access
  • Before running “git push stage/pro master”, make sure to check in the changes to git (git add, git commit)

Part 2, set up databases

  • To create the Postgres database:
    • sudo su — postgres
    • psql
      • # CREATE DATABASE wordcount_dev;
      • # CREATE USER ‘<your user name>’
      • # GRANT ALL PRIVILEGES ON wordcount_dev TO <your user name>;
  • After running “heroku run python manage.py db upgrade …” I got the error message:
    No such file or directory: ‘/app/migrations/versions’

    • Locally I had an empty directory <app folder>/migrations/versions. However, git ignores empty directories. This is why I could run “.. manage.py db upgrade” locally but not on heroku
    • Oops, I’d forgotten to run
      python manage.py db migrate
      Now it worked fine
    • If you make the same mistake, remember to propagate the changes to heroku and then re-run db migrate on heroku

Part 3, requests, Beautiful Soup and NLTK

  • At one stage I got a server error. To sort this I looked at the heroku log:
    heroku logs –app <heroku app name>
  • When I ran the nltk downloader I didn’t get the usual gui but a “tui” (text user interface). It was fairly simple to navigate, but I didn’t bother to specify the location of the tokenizers. Instead I used the default (~/nltk_data) and then moved nltk_data into my app folder
  • The links to Bootstrap and jQuery didn’t work, either because I mistyped them or because they are out of date. The Bootstrap and jQuery websites give you up-to-date CDN links, so use those instead

Part 4, Redis task queue

  • I used these instructions to install Redis on Linux Mint
  • Apart from the inevitable few typing mistakes, everything worked remarkably smoothly. Nothing else to add

Part 5, Adding in Angular

  • It all went well, until I added the getWordCount function to the controller. I’d put the function inside the main.js file, but outside of the controller. When poller got called, none of the dependencies were included, so it couldn’t find $http (first line of poller function)
    • The error was: $http not defined
    • Despite comparing my version with the author’s GitHub one, I couldn’t see the difference. In the end I used the author’s version (of main.js) instead of mine. That worked fine. It took another line by line comparison to find the problem
  • The word/frequency list is no longer sorted. jsonify loses the order

Part 6, Staging the changes, including Redis

  • So far I’ve been using a free account. When I tried to add on Redis, heroku tells me: Please verify your account to install this add-on plan (please enter a credit card)
    • If I understand it correctly, it is still free (but don’t take my word for it – and don’t come back to me if you end up getting charged)
    • I entered my credit card details for my Heroku again. Now I can add Redis
  • “heroku addons:add redistogo –app” gave a warning to say that “addons:add” has been deprecated.
    • I used “addons:create” instead

 

Administrate Project – Technical Project Manager

The client, a highly respected training company, was using an in-house developed system to manage the courses, bookings, delegates, and sales and marketing processes. When the business outgrew the system they asked me to evaluate an online solution (Administrate).

After presenting my in-depth analysis, the Board of Directors decided to proceed, and hired me as the Technical Project Manager.

I created Python scripts to clean the old database and restructure the data for the new system. Under my guidance, the internal developer integrated the client’s website with Administrate. Administrate staff and myself provided the training.

Overall the transition went very smoothly. The only snag was with some of the more advanced marketing lists (used for email marketing campaigns). I had already expressed my concerns to Administrate that the existing system wouldn’t be able to handle these queries, but was reassured by them that it would. Fortunately, when my concerns were confirmed, Administrate agreed to make the necessary changes. They were very good about it and did this without any protest, although this may have been different without my careful preparations. Administrate spent a good few developer months creating a completely new reporting system, in consultation with me.

Meticulous preparations meant a smooth transition and saved the client months of bespoke work.

Django Migrations

Django has a great system for maintaining your database structure, for keeping your RDMS (typically MySQL, PostgreSQL or SQLite) in line with your Django-defined Models, and within your version control system.

I recently hit a snag with a live client site, which already has 100s of registered users, significant numbers in a small market. Something to handle with extreme caution. Time to dive a bit deeper into Django’s migrations.

Note: migrations were introduced in Django 1.7

Models

Instead of working directly on the database, Django expects you to define the database structure in Python code. This powers many of Django’s features. For instance, the admin system gives  administrators an easy way to maintain the data, and forms can be quickly created.

Once you’ve defined the models and configured the database connection (username, password, etc), Django creates the table structures for you.

Generate the initial database structure by running
>> python manage.py makemigrations

This generates the ‘migration’ path (see below) from an empty database to the initial database

Then apply this to the actual database
>> python manage.py migrate

If you’ve used Django 1.6 or earlier, note that syncdb has been deprecated and replaced by migrate

Migrations

But what happens when you change a table, e.g. when you delete or create a field? And how do you track the changes on your development machine, and make sure the same changes are applied to your test and live systems?

Every time you run
>> python manage.py makemigrations
Django compares the last known Model structure (by running through all past migrations) to the current Model structure and generates a new migrations file, one for each Django app with a changed Model structure. These files are stored in <app>/migrations and are numbered (starting with 0001).

Running
>> python manage.py migrate
applies the most recent migrations. If the database is empty, then Django will start from 0001_initial.py. Otherwise, it checks the migrations table to see, for each app, which migration is the latest one which has been applied to the database, and then runs any later migrations

Migrations and version control

Because the migration files are ordinary Python (text) files, stored in an app’s sub directory, your version control system will manage them for you. When you update the test or live system with the latest code changes, the latest migrations will also be copied across.

Just make sure to run
>> python manage.py migrate
after updating the source code through your version control system

 What went wrong

Normally Django’s migration system works beautifully, making sure the models and database structure stay in sync, between each other and across the development, test and live systems.

Due to some complicated reasons involving tight deadlines and budgets, holidays, and more, something went wrong with the migrations.

The client requested some extra database fields. I added the fields to the model and ran makemigrations.

First nothing happened. Just to be sure I ran
>> python manage.py makemigrations <app name>
and Django told me it had created migration 0001_initial, instead of an incremental change. Checking the database, no changes were made.

Diving into git, I could see a series of past migration files. I could probably roll back the changes to them (i.e. which deleted them), but decided to try something a bit simpler first.

 And how I fixed it

I deleted all existing migrations, from <app name>/migrations and commented out the new database fields in the model.

Running
>> python manage.py makemigrations
created a few 0001_initial migration file, which matches the original model and the current database structure

I removed the comments, to re-instate the new database fields, and re-ran
>> python manage.py makemigrations
to create a migrations file which adds the new fields to the database

Finally
>> python manage.py migrate
applied the changes to the database, adding the new fields

So far so good

Using PHPMyAdmin I can now see the new database fields in the MySQL database

However, I have yet to apply the changes to the live system. I’ll do this once the changes are made and tested. But first I’ll check the revision history on the server and I’ll backup the database, just in case

Using PostgreSQL with Django

Here is how go set up Django to work with PostgreSQL

  1. Install necessary libraries, etc
    1. sudo apt-get install libpq-dev
    2. sudo apt-get install python-dev
    3. sudo apt-get install postgresql-contrib
  2. Create a new database and user
    1. sudo su – postgres
    2. createdb djangodev
    3. createuser -P djangodev
      1. Enter password, twice
    4. psql
      1. postgres=#  GRANT ALL PRIVILEGES ON DATABASE djangodev TO djangodev;
      2. \q
  3. Make sure you have your virtual environment activated
  4. pip install psycopg2
  5. Open up the project’s settings.py and change the DATABASES to:
    DATABASES = {
    ‘default’: {
    ‘ENGINE’: ‘django.db.backends.postgresql_psycopg2’, # Add ‘postgresql_psycopg2’, ‘mysql’, ‘sqlite3’ or ‘oracle’.
    ‘NAME’: ‘djangodev’, # Or path to database file if using sqlite3.
    ‘USER’: ‘djangodev’,
    ‘PASSWORD’: ‘904ojioe_=3D’,
    ‘HOST’: ‘localhost’, # Empty for localhost through domain sockets or ‘127.0.0.1’ for localhost through TCP.
    ‘PORT’: ”, # Set to empty string for default.
    }
    }
  6. python manage.py syncdb

Installing PostgreSQL

PostgreSQL seems to be the most popular DBMS (database management system) with Django developers, although MySQL is also used a lot

To install PostgreSQL, I used Linux Mint’s Software Manager (search for “postgresql”)

I also installed pgAdmin III, “a database design and management application for use with PostgreSQL” using the same method

I used to work on a system which used PostgreSQL, but that is a long time ago, so I had to ask the Internet to remind me how to get it going. Here is how to get it started

  1.  Using the Software Manager, install PostgreSQL and pgadmin3
  2. Set the PostgreSQL password, for the postgres user:
    1. sudo -u postgres psql
    2. postgres=#      \password postgress
    3. (set the password)
    4. \q
  3. Start Programming -> pgAdmin III
    1. Click on “Server Groups”
    2. Click on the plug icon (top left hand corner)
    3. Name: Local DBMS (or whatever you want to call it)
    4. Host: localhost
    5. Port, Service, Maintenance DB: leave as is
    6. Username: postgres
    7. Password: the password you set in the step above
    8. Click on “Ok”

You should now be able to view your PostgreSQL server in pgAdmin, and use it to manage users, databases, etc

Django – making changes to a model

Unless you get your model/database structure spot on first time, you’ll want to change it. Here is how to keep the database in line with your Django models

  1. Make a change to your model. For instance, in the books app, to the Author class, add an extra field:
    email_address = models.EmailField()
  2. Create the migrations file: python manage.py makemigrations
  3. Run the migrations file: python manage.py migrate

That’s all. Have a look in the admin panel to see the new field