Data can often be stored in a multiple of file formats:

- Excel spreadsheets (
`.xlsx`

); - Comma seperated files (
`.csv`

); - Json (
`.json`

); - ...

Simlarly you might want to store data in any of the above data formats. This is where the Pandas library can be useful:

"... easy-to-use data structures and data analysis tools for the Python programming language."

In this section we will see how to:

- Read in data files;
- Query those data files;
- Write to data files.

Consider the file goldbach.xlsx which contains rows of data confirming the Goldbach conjecture:

Every even integer greater than 2 can be expressed as the sum of two primes.

The data is made up of 3 columns: $N$, $a$ and $b$ and there is a row for every possible expression of $N = a + b$ for $a \leq b$ prime. Note that this data was made using Appendix A.

Let us start by importing Pandas and reading in the data file. To do this you will need to know the path to the file on your computer:

In [1]:

```
import pandas as pd
```

In [2]:

```
df = pd.read_excel("data/goldbach.xlsx")
```

This reads in the excel file as a Pandas data frame. Let us take a look at the first few rows of the data frame (the `head`

):

In [3]:

```
df.head()
```

Out[3]:

Let us look at the last few rows (the `tail`

):

In [4]:

```
df.tail()
```

Out[4]:

A quick way of getting a summary of the data we have is with `.describe()`

:

In [5]:

```
df.describe()
```

Out[5]:

Although in this particular data set that is not terribly useful, that's more useful in the case of statistical analysis.

Let us take a closer look at a specific number ($N=322$) and the ways it can be written as the sum of two primes.

In [6]:

```
df[df['N'] == 322]
```

Out[6]:

To briefly explain what is happening there: `df['N'] == 322`

is returning a series of `True`

and `False`

, identify the positions in our data where $N=322$. We directly pass that series to the data frame to get out those numbers `df[df['N'] == 322]`

.

Let us find out how many decompositions exist for each number in our data frame. We will do this using the `value_counts()`

statement:

In [7]:

```
s = df['N'].value_counts()
s
```

Out[7]:

The above is a series (ordered by count), let us rename the count variable and create a new dataframe:

In [8]:

```
df = pd.DataFrame(s.rename('counts'))
df.head()
```

Out[8]:

We have created a dataframe with an index given by `N`

. Let us create a new variable which is the normalised count:

In [9]:

```
df["normalised_counts"] = df['counts'] / df.index
```

In [10]:

```
df.head()
```

Out[10]:

We could now look a bit closer at this using `describe`

:

In [11]:

```
df.describe()
```

Out[11]:

We can also directly plot our data, but just as when we used Sympy we need to tell Jupyter to display the plots in the notebook:

In [12]:

```
%matplotlib inline
```

Here is a quick histogram of the counts:

In [13]:

```
df['counts'].hist();
```

The normalised counts:

In [14]:

```
df['normalised_counts'].hist();
```

Finally let us write this data to a 'comma seperated value' (`.csv`

) in case we wanted to look at it later:

In [15]:

```
df.to_csv("data/goldbach_counts.csv")
```

This section briefly showed us how to use pandas to read, manipulate and write data. We also took a brief look at plotting with pandas but this should only be used in an exploratary way. For higher quality plots matplotlib is recommended.