Python for Data Science

Joe McCarthy, Data Scientist, Indeed

In [1]:
from IPython.display import display, Image, HTML

1. Introduction

python-logo-master-v3-TM.png This short primer on Python is designed to provide a rapid "on-ramp" to enable computer programmers who are already familiar with concepts and constructs in other programming languages learn enough about Python to facilitate the effective use of open-source and proprietary Python-based machine learning and data science tools.

nltk_book_cover.gif The primer is motivated, in part, by the approach taken in the Natural Language Toolkit (NLTK) book, which provides a rapid on-ramp for using Python and the open-source NLTK library to develop programs using natural language processing techniques (many of which involve machine learning).

The Python Tutorial offers a more comprehensive primer, and opens with an excellent - if biased - overview of some of the general strengths of the Python programming language:

Python is an easy to learn, powerful programming language. It has efficient high-level data structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid application development in many areas on most platforms.

Python Scripting for Computational Science cover Hans Petter Langtangen, author of Python Scripting for Computational Science, emphasizes the utility of Python for many of the common tasks in all areas of computational science:

Very often programming is about shuffling data in and out of different tools, converting one data format to another, extracting numerical data from a text, and administering numerical experiments involving a large number of data files and directories. Such tasks are much faster to accomplish in a language like Python than in Fortran, C, C++, C#, or Java

Foster Provost, co-author of Data Science for Business, describes why Python is such a useful programming language for practical data science in Python: A Practical Tool for Data Science, :

The practice of data science involves many interrelated but different activities, including accessing data, manipulating data, computing statistics about data, plotting/graphing/visualizing data, building predictive and explanatory models from data, evaluating those models on yet more data, integrating models into production systems, etc. One option for the data scientist is to learn several different software packages that each specialize in one or two of these things, but don’t do them all well, plus learn a programming language to tie them together. (Or do a lot of manual work.)

An alternative is to use a general-purpose, high-level programming language that provides libraries to do all these things. Python is an excellent choice for this. It has a diverse range of open source libraries for just about everything the data scientist will do. It is available everywhere; high performance python interpreters exist for running your code on almost any operating system or architecture. Python and most of its libraries are both open source and free. Contrast this with common software packages that are available in a course via an academic license, yet are extremely expensive to license and use in industry.

scikit-learn-logo-small.png The goal of this primer is to provide efficient and sufficient scaffolding for software engineers with no prior knowledge of Python to be able to effectively use Python-based tools for data science research and development, such as the open-source library scikit-learn. There is another, more comprehensive tutorial for scikit-learn, Python Scientific Lecture Notes, that includes coverage of a number of other useful Python open-source libraries used by scikit-learn (numpy, scipy and matplotlib) - all highly recommended ... and, to keep things simple, all beyond the scope of this primer.

Using an IPython Notebook as a delivery vehicle for this primer was motivated by Brian Granger's inspiring tutorial, The IPython Notebook: Get Close to Your Data with Python and JavaScript, one of the highlights from my Strata 2014 conference experience. You can run this notebook locally in a browser once you install ipython notebook.

One final note on external resources: the Python Style Guide (PEP-0008) offers helpful tips on how best to format Python code. Code like a Pythonista offers a number of additional tips on Python programming style and philosophy, several of which are incorporated into this primer.

We will focus entirely on using Python within the interpreter environment (as supported within an IPython Notebook). Python scripts - files containing definitions of functions and variables, and typically including code invoking some of those functions - can also be run from a command line. Using Python scripts from the command line may be the subject of a future primer.

To help motivate the data science-oriented Python programming examples provided in this primer, we will start off with a brief overview of basic concepts and terminology in data science.

Notebooks in this primer: