Ska Analysis Tutorial

Overview

This document gives a tutorial introduction to analysis using the Ska analysis environment, including basic configuration and a number of examples.

The Ska analysis environment consists of a full-featured Python installation with interactive analysis, plotting and numeric capability along with the Ska engineering telemetry archive.

The telemetry archive consists of:

  • Tools to ingest and compress telemetry from the CXC Chandra archive products.

  • Compressed telemetry files in HDF5 format. Each MSID has three associated products:

    • Full time-resolution data: time, value, quality

    • 5-minute statistics: min, max, mean, sampled value, number of samples

    • Daily statistics: min, max, mean, sampled value, standard deviation, percentiles (1, 5, 16, 50, 84, 95, 99), number of samples.

Configure

To set up for using the Ska3 environment, visit the Ska3 configuration hints page.

Basic Functionality Test

Once you have done the configuration setup that was just described, open a new xterm window and get into the Ska3 environment by use the ska3 alias:

% ska3

To test the basic functionality of your setup, try the following at the linux shell prompt. In this tutorial all shell commands are shown with a “% ” to indicate the shell prompt. Your prompt may be different and you should not include the “% “ when copying the examples.:

% ipython --matplotlib

You should see something that looks like:

Python 3.8.3 (default, Jul  2 2020, 11:26:31)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.16.1 -- An enhanced Interactive Python. Type '?' for help.

In [1]:

Now read some data and make a simple plot by copying the following lines in ipython:

import cheta.fetch as fetch
import matplotlib.pyplot as plt
tephin = fetch.MSID('tephin', '2009:001', '2009:002')
plt.plot(tephin.times, tephin.vals)

You should get a figure like:

_images/basic_func.png

Tools overview

There are four key elements that are the basis for doing plotting and analysis with the engineering archive.

  • cheta.fetch: module to read and manipulate telemetry data

  • IPython: interactive python interpreter

  • matplotlib: python plotting package with an interface similar to Matlab

  • NumPy: python numerical package for fast vector and array math

cheta.fetch

The tools to access and manipulate telemetry with the Ska engineering archive are described in the Fetch Tutorial.

IPython

IPython is a command-line tool that provides a python prompt that is the basis for interactive analysis. At the core it provides an interpreter for python language commands but with the addition of external packages like matplotlib and numpy it becomes a full-featured data analysis environment. IPython is similar in many ways to the command-line interface in Matlab or IDL.

See the IPython documentation page.

For interactive data analysis IPython has a special --matplotlib command line option which enables interactive plotting from the command line:

% ipython --matplotlib

import matplotlib.pyplot as plt
import numpy as np

x = np.arange(0, 10, 0.2)
y = np.sin(x)
print(x)
plt.plot(x, y)

Keyboard navigation and history

One of the most useful features of IPython is the ability to edit and navigate you command line history. This lets you quickly re-do commands, perhaps with a slight variation based on seeing the last result. Try cut-n-pasting the above lines in an IPython session. This should bring up a plot of a sine wave.

Now hit up-arrow once and get back the plt.plot(x, y) line. Hit the left-arrow key (not backspace) once and type **2 so that the line reads plt.plot(x, y**2). Now you can hit Return to see the new curve overlayed within the same plot window. It is not necessary to forward-space to the end of the line, you can hit Return with the cursor anywhere in the line.

Now say you want to change the x values slightly. One option is to just hit the up-arrow 5 times, but a much faster way is to remember that the line started with x, so type x and then start hitting up-arrow. Only lines that start with x will be displayed and you are immediately at the x = np.arange(0, 10, 0.2) line. Now use the right-arrow and backspace to change 10 to 15 and hit Return. Of course y needs to be recalculated, so hit y then up-arrow, then Return. Finally pl up-arrow and Return. Nice and fast!

Bonus points: to speed up by another factor of several, use Ctrl-p (prev) instead of up-arrow, Ctrl-n (next) instead of down-arrow, Ctrl-f (forward) instead of right-arrow and Ctrl-b (back) instead of left-arrow. That way your fingers never leave the keyboard home keys. Ctrl-a gets you to the beginning of the line and Ctrl-e gets you to the end of the line. Triple bonus: on a Mac or Windows machine re-map the Caps-lock key to be Control so it’s right next to your left pinky. How often do you need Caps-lock?

Your command history is saved between sessions (assuming that you exit IPython gracefully) so that when you start a new IPython you can use up-arrow to re-do old commands. You can view your history within the current session by entering history.

Linux and shell commands

A select set of useful linux commands are available from the IPython prompt. These include ls (list directory), pwd (print working directory), cd (change directory), and rm (remove file). Any shell command can be executed by preceding it with an exclamation point “!”.

Tab completion

IPython has a very useful tab completion feature that can be used both to complete file names and to inspect Python objects. As a first example do:

ls /proj/sot/ska/<TAB>

This will list everything that matches /proj/sot/ska. You can continue this way searching through files or hit Return to complete the command.

Tab completion also works to inspect object attributes. Every variable or constant in python is actually a object with a type and associated attributes and methods. For instance try to create a list of 3 numbers:

a = [3, 1, 2, 4]
print(a)
a.<TAB>

This will show the available methods for a:

In [17]: a.<TAB>
a.append   a.extend   a.insert   a.remove   a.sort
a.count    a.index    a.pop      a.reverse

Here you see useful looking functions like append or sort which you can get help for and use:

help a.sort
a.sort()
print(a)

For a more concrete example, say you want to fetch some daily telemetry values but forgot exactly how to do the query and what are the available columns. Use help and TAB completion to remind yourself:

import cheta.fetch as fetch
help fetch
tephin = fetch.MSID('tephin', '2009:001', '2009:002', stat='daily')
tephin.<TAB>
 tephin.bads               tephin.midvals            tephin.samples
 tephin.colnames           tephin.mins               tephin.stat
 tephin.content            tephin.msid               tephin.state_codes
 tephin.datestart          tephin.MSID               tephin.state_intervals
 tephin.datestop           tephin.p01s               tephin.stds
 tephin.dt                 tephin.p05s               tephin.tdb
 tephin.filter_bad         tephin.p16s               tephin.times
 tephin.filter_bad_times   tephin.p50s               tephin.tstart
 tephin.indexes            tephin.p84s               tephin.tstop
 tephin.iplot              tephin.p95s               tephin.unit
 tephin.logical_intervals  tephin.p99s               tephin.vals
 tephin.maxes              tephin.plot               tephin.write_zip
 tephin.means              tephin.raw_vals

OK, now you remember you wanted times and maxes. But look, there are other tidbits there for free that look interesting. So go ahead and print a few:

print(tephin.colnames)
print(tephin.dt)
print(tephin.MSID)

To make it even easier you don’t usually have to use print. Try the following:

tephin.colnames
tephin.dt
tephin.MSID

Don’t be scared to try printing an array value (e.g. tephin.vals) even if it is a billion elements long. Numpy will only print an abbreviated version if it is too long. But beware that this applies to Numpy arrays which as we’ll see are a special version of generic python lists. If you print a billion-element python list you’ll be waiting for a while.

NumPy

Even though it was not explicit we have already been using NumPy arrays in the examples so far. NumPy is a Python library for working with multidimensional arrays. The main data type is an array. An array is a set of elements, all of the same type, indexed by a vector of nonnegative integers.

NumPy has an excellent quickstart tutorial available.

Here we just show the very basics. In these examples the python prompt is shown as “>>>” in order to distinguish the input from the outputs.

Arrays can be created in different ways:

>>> import numpy as np
>>> a = np.array( [ 10, 20, 30, 40 ] )   # create an array out of a list
>>> a
array([10, 20, 30, 40])
>>> b = np.arange( 4 )                   # create an array of 4 integers, from 0 to 3
>>> b
array([0, 1, 2, 3])
>>> c = np.linspace(-pi,pi,3)            # create an array of 3 evenly spaced samples from -pi to pi
>>> c
array([-3.14159265,  0.        ,  3.14159265])

New arrays can be obtained by operating with existing arrays:

>>> d = a+b**2                        # elementwise operations
>>> d
array([10, 21, 34, 49])

Arrays may have more than one dimension:

>>> x = np.ones( (3,4) )
>>> x
array([[1., 1., 1., 1.],
       [1., 1., 1., 1.],
       [1., 1., 1., 1.]])
>>> x.shape                            # a tuple with the dimensions
(3, 4)

and you can change the dimensions of existing arrays:

>>> y = np.arange(12)
>>> y
array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])
>>> y.shape = 3,4              # does not modify the total number of elements
>>> y
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
     [ 8,  9, 10, 11]])

It is possible to operate with arrays of different dimensions as long as they fit well (broadcasting):

>>> 3*a                                # multiply each element of a by 3
array([ 30,  60,  90, 120])
>>> a+y                                # sum a to each row of y
array([[10, 21, 32, 43],
       [14, 25, 36, 47],
       [18, 29, 40, 51]])

Similar to Python lists, arrays can be indexed, sliced and iterated over:

>>> a[2:4] = -7,-3                     # modify last two elements of a
>>> for i in a:                        # iterate over a
...     print(i)
...
10
20
-7
-3

When indexing more than one dimension, indices are separated by commas:

>>> x[1,2] = 20
>>> x[1,:]                             # x's second row
array([ 1,  1, 20,  1])
>>> x[0] = a                           # change first row of x
>>> x
array([[10, 20, -7, -3],
       [ 1,  1, 20,  1],
       [ 1,  1,  1,  1]])

Matplotlib

The pyplot mode of matplotlib is a collection of command style functions that make matplotlib work like matlab. Each pyplot function makes some change to a figure: eg, create a figure, create a plotting area in a figure, plot some lines in a plotting area, decorate the plot with labels, etc…. Pyplot is stateful, in that it keeps track of the current figure and plotting area, and the plotting functions are directed to the current axes. On the matplotlib FAQ page there is a very good discussion on Matplotlib, pylab, and pyplot: how are they related?.

This tutorial has been copied and adapted from the matplotlib pyplot tutorial.

To see matplotlib in action and make a simple plot do:

import matplotlib.pyplot as plt
plt.clf()
plt.plot([1,2,3])
plt.ylabel('some numbers')
_images/pyplot_simple.png

You may be wondering why the x-axis ranges from 0-2 and the y-axis from 1-3. If you provide a single list or array to the plot() command, matplotlib assumes it is a sequence of y values, and automatically generates the x values for you. Since python ranges start with 0, the default x vector has the same length as y but starts with 0. Hence the x data are [0,1,2].

plot() is a versatile command, and will take an arbitrary number of arguments. For example, to plot x versus y, you can issue the command:

plt.clf()
plt.plot([1,2,3,4], [1,4,9,16])

For every x, y pair of arguments, there is a optional third argument which is the format string that indicates the color and line type of the plot. The letters and symbols of the format string are from matlab, and you concatenate a color string with a line style string. The default format string is ‘b-’, which is a solid blue line. For example, to plot the above with red circles, you would issue:

plt.clf()
plt.plot([1,2,3,4], [1,4,9,16], 'ro')
plt.axis([0, 6, 0, 20])
_images/pyplot_formatstr.png

See the plot() documentation for a complete list of line styles and format strings. The axis() command in the example above takes a list of [xmin, xmax, ymin, ymax] and specifies the viewport of the axes.

If matplotlib were limited to working with lists, it would be fairly useless for numeric processing. Generally, you will use numpy arrays. In fact, all sequences are converted to numpy arrays internally. The example below illustrates a plotting several lines with different format styles in one command using arrays.

# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)

# red dashes, blue squares and green triangles
plt.clf()
plt.plot(t, t, 'r--', t, t**2, 'bs', t, t**3, 'g^')
_images/pyplot_three.png

Working with multiple figures and axes

Both Matlab and Pylab have the concept of the current figure and the current axes. All plotting commands apply to the current axes. The function gca() returns the current axes and gcf() returns the current figure. Normally, you don’t have to worry about this, because it is all taken care of behind the scenes. Below is a script to create two subplots.

def f(t):
    return np.exp(-t) * np.cos(2*np.pi*t)

t1 = np.arange(0.0, 5.0, 0.1)
t2 = np.arange(0.0, 5.0, 0.02)

plt.figure(1)
plt.clf()
plt.subplot(2, 1, 1)  # (nrows, ncols, fignum)
plt.plot(t1, f(t1), 'bo', t2, f(t2), 'k')

plt.subplot(2, 1, 2)
plt.plot(t2, cos(2*pi*t2), 'r--')
_images/pyplot_two_subplots.png

The figure() command here is optional because figure(1) will be created by default, just as a subplot(1, 1, 1) will be created by default if you don’t manually specify an axes. The subplot() command specifies numrows, numcols, fignum where fignum ranges from 1 to numrows*numcols. The commas in the subplot command are optional if numrows*numcols<10. You can create an arbitrary number of subplots and axes. If you want to place an axes manually, ie, not on a rectangular grid, use the axes() command, which allows you to specify the location as axes([left, bottom, width, height]) where all values are in fractional (0 to 1) coordinates.

You can create multiple figures by using multiple figure() calls with an increasing figure number. Of course, each figure can contain as many axes and subplots as your heart desires:

plt.figure(1)                # the first figure
plt.clf()
plt.subplot(2, 1, 1)         # the first subplot in the first figure
plt.plot([1,2,3])
plt.subplot(2, 1, 2)         # the second subplot in the first figure
plt.plot([4,5,6])


plt.figure(2)                # a second figure
plt.clf()
plt.plot([4,5,6])            # creates a subplot(111) by default

plt.figure(1)                # figure 1 current; subplot(2, 1, 2) still current
plt.subplot(2, 1, 1)         # make subplot(2, 1, 1) in figure1 current
plt.title('Easy as 1,2,3')   # subplot 2, 1, 1 title

You can clear the current figure with clf() and the current axes with cla(). If you find this statefulness, annoying, don’t despair, this is just a thin stateful wrapper around an object oriented API, which you can use instead (see artist-tutorial)

Working with text

The text() command can be used to add text in an arbitrary location, and the xlabel(), ylabel() and title() are used to add text in the indicated locations (see text-intro for a more detailed example):

mu, sigma = 100, 15
x = mu + sigma * np.random.normal(size=10000)

# the histogram of the data
plt.clf()
n, bins, patches = plt.hist(x, 50, normed=1, facecolor='g', alpha=0.75)

plt.xlabel('Smarts')
plt.ylabel('Probability')
plt.title('Histogram of IQ')
plt.text(60, .025, r'$\mu=100,\ \sigma=15$')
plt.axis([40, 160, 0, 0.03])
plt.grid(True)
_images/pyplot_text.png

All of the text() commands return an matplotlib.text.Text instance. Just as with with lines above, you can customize the properties by passing keyword arguments into the text functions or using setp():

t = plt.xlabel('my data', fontsize=14, color='red')

These properties are covered in more detail in text-properties <http://matplotlib.org/users/text_props.html#text-properties>.

Using mathematical expressions in text

matplotlib accepts TeX equation expressions in any text expression. For example to write the expression \sigma_i=15 in the title, you can write a TeX expression surrounded by dollar signs:

plt.title(r'$\sigma_i=15$')

The r preceeding the title string is important – it signifies that the string is a raw string and not to treate backslashes and python escapes. matplotlib has a built-in TeX expression parser and layout engine, and ships its own math fonts – for details see mathtext-tutorial. Thus you can use mathematical text across platforms without requiring a TeX installation. For those who have LaTeX and dvipng installed, you can also use LaTeX to format your text and incorporate the output directly into your display figures or saved postscript – see usetex-tutorial.

Annotating text

The uses of the basic text() command above place text at an arbitrary position on the Axes. A common use case of text is to annotate some feature of the plot, and the annotate() method provides helper functionality to make annotations easy. In an annotation, there are two points to consider: the location being annotated represented by the argument xy and the location of the text xytext. Both of these arguments are (x,y) tuples.

_images/pyplot_annotate.png

In this basic example, both the xy (arrow tip) and xytext locations (text location) are in data coordinates.