Category Archives: Data Analysis

Sample chart

Retrieve and display a data set

(First part of the “Practical Python in 10 lines or less” series)

Python is a simple but powerful language, and comes with a wealth of libraries. The chart above took just 10 lines of Python. All the hard work is done by the Pandas and MatPlotLib libraries.

The code

import pandas, matplotlib
data = pandas.read_csv('http://www.compassmentis.com/wp-content/uploads/2019/04/cereal.csv')
data = data.set_index('name')
data = data.calories.sort_values()[-10:]
ax = data.plot(kind='barh')
ax.set_xlabel('Calories per serving')
ax.set_ylabel('Cereal')
ax.set_title('Top 10 cereals by calories')
matplotlib.pyplot.subplots_adjust(left=0.45)
matplotlib.pyplot.show()

How it works

You will need Python and the Pandas and MatPlotLib libraries. See the installation instructions

Get started

1. import pandas, matplotlib
Grab the libraries we need to load, clean up and display the data.
The recommended approach (PEP 8) is to have two import statements on separate lines. To leave enough lines to make the chart look good, in this example I have combined them.

2. data = pandas.read_csv(‘http://www.compassmentis.com/wp-content/uploads/2019/04/cereal.csv’)
Load the csv data from a website. This gives us a pandas DataFrame, a two dimensional datastructure similar to a page in a spreadsheet.
I downloaded the data from https://www.kaggle.com/crawford/80-cereals/version/2, under Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0) [https://creativecommons.org/licenses/by-sa/3.0/]

3. data = data.set_index(‘name’)
Set the row names (index) to the ‘name’ column. When we plot the data this becomes the data labels.

4. data = data.calories.sort_values()[-10:]
Take the ‘calories’ column, sort it and limit to the last 10 values. This gives us the 10 cereals with the highest calories per serving

5. ax = data.plot(kind=’barh’)
Plot the data as a horizontal bar chartax.set_xlabel(‘Calories per serving’)

6. ax.set_ylabel(‘Cereal’)
7. ax.set_title(‘Top 10 cereals by calories’)
8. ax.set_xlabel(‘Area in millions square kilometers’)

Set the label for the x and y axes and the title

9. matplotlib.pyplot.subplots_adjust(left=0.45)
Set the left margin (from the left of the image to the left of the chart area) to 45% to give enough space for the cereal names.

10. matplotlib.pyplot.show()
Show the chart

Data Analysis with Python

Python is a very popular tool for data extraction, clean up, analysis and visualisation. I’ve recently done some work in this area, and would love to do some more. I particularly enjoy using my maths background and creating pretty, clear and helpful visualisations

  • Short client project, analysing sensor data. I took readings from two accelerometers and rotated the readings to get the relative movement between them. Using NumPy, Pandas and MatplotLib, I created a number of different charts, looking for a correlation between the equipment’s setting and the movement. Unfortunately the sensors aren’t sensitive enough to return usable information. Whilst not the outcome they were hoping for, the client told me “You’ve been really helpful and I’ve learned a lot”
  • At PyCon UK (Cardiff, September 2018) I attended 14 data analysis sessions. It was fascinating to see the range of tools and applications in Python data analytics. At a Bristol PyData MeetUp I summarised the sessions in a 5 minute lightening talk. This made me pay extra attention and keep useful notes during the conference
  • Short client project, researching best way to import a large data set, followed by implementation. The client regularly accesses large datasets using a folder hierarchy\to structure that data. They were looking to replace this with a professional database, i.e. PostgreSQL. I analysed their requirements, researched the different storage methods in PostgreSQL, reported my findings and created an import script.

Python for Data Analysis and Natural Language Processing

As I’m making my way through Natural Language Processing with Python and Data Science from Scratch: First Principles with Python, the first step is to set up the development environment.

My first attempt was to install numpy, python, nltk, matplotlib, IPython, etc, one at a time. However, I hit a few clashes between Python versions, so switched to Anaconda instead:

  1. Download Anaconda
  2. From the download folder, execute
    1. sh <name of downloaded file>
    2. Accept the defaults, but answer yes to preprend to PATH
  3. To check it works, start Python
    1. import numpy
    2. import pandas
    3. import matplotlib
  4. Download the nltk assets. Start Python and enter:
    1. import nltk
    2. nltk.download()
    3. use GUI to download “all”