Long exercise: Data handling and presentation ============================================= .. highlight:: console .. role:: bash(code) :language: bash :class: highlight This page is intended as an exercise in data handling, processing and presentation. Of course, as with everything else in this course, we'll only be touching the surface! Before we begin, you should always keep in mind that the best tool for data generation or handling may not be the best for presentation: a good example of this is Mathematica, which is wonderful for maths work, but leaves something to be desired when it comes to producing pretty plots. Be open minded! We also won't be dealing at all with the aesthetics of efficient data presentation: that's left to your taste, intuition and experience (and to Edward Tufte's superb book, *The Visual Display of Quantitative Information*). The task is to combine several software packages together to produce parameter scans for sparticle masses in BSM models. We will use Softsusy to produce mass spectra, and visualize these with pySLHA and self-written contour plots. Before we begin, le't have a quick look at Makefiles. Makefiles and :command:`make` ----------------------------- Recompiling large code projects after each change can take a long time, but often it is not necessary to rebuild all of it. Changes in :file:`.cc` files, for example, only require recompilation of the corresponding :file:`.o` and re-linking. Changes in header files affect all code where they are included. Makefiles provide a way of tracking these dependencies. Each entry consists of a target, followed by the files it depends on. On the following lines, you can optionally write *rules* (which must be preceded by :kbd:`Tab` characters, not spaces! [#fn1]_), specifying how to make the target from the dependencies. A rule can be any shell command, and you can have several rules attached to a target, which will be executed in sequence. Prepare two text files :file:`bar` and :file:`baz` and write this example :file:`Makefile`:: foo : bar baz [--tab--]cat bar baz > foo Running the command :command:`make` in the directory with the Makefile now will run :bash:`cat bar baz > foo` in turn, producing a :file:`foo` file. Re-running :command:`make` again will do nothing, since none of the up-stream dependents of :file:`foo` have changed. Edit :file:`bar` and re-run :command:`make`! :command:`make` has several built-in rules for common situations. It already knows how to build object files from cc files, so it's not necessary to specify a rule here:: foo.o : foo.cc foo.h bar.h bar.o : bar.cc bar.h :command:`make` is most often used for software compilation, but it can be very helpful in any situation where a sequence of interdependent actions has to be performed in a certain order whenever a file changes. .. rubric:: Footnotes .. [#fn1] One of the top-ten list of bad design choices in software. Let's move on to the actual project. Installation of a source code package ------------------------------------- Softsusy's source code is available from http://softsusy.hepforge.org/. Like most Fortran-, C- or C++-codes, it is distributed as a tarball with a standardized installation procedure using the *GNU autotools*. After unpacking the tarball and entering the :file:`softsusy-nn.nn.nn` directory, the first step is to run:: $ ./configure --prefix=$HOME/local F77=gfortran The :option:`--prefix` argument determines where the compiled code will be installed. The default value here is :file:`/usr/local/`, which is usually not writeable by regular users, so we need to supply our own location. The :envvar:`F77` variable can be used to override the default Fortran compiler choice. Similar variables are :envvar:`CC` for the C compiler, as well as :envvar:`CXX` for the choice of C++ compiler. The main output of :command:`configure` is a set of :dfn:`Makefiles`, which build the package with all the specified settings. The automatically produced Makefiles from :command:`configure` are not intended to be human-readable, but they work in exactly the same way as our simple example above. Practically all ready-made HEP packages come as tarballs with included Makefiles. As we've seen, some require an extra configuration step to prepare the Makefiles, e.g. to provide an installation location or other setup information. The usual sequence of commands is:: $ tar xzvf somepackage-1.2.3.tar.gz $ cd somepackage-1.2.3 $ ./configure --prefix=$HOME/local/ $ make $ make install Running Softsusy ---------------- Let's get back to our exercise. After installing Softsusy, we're ready for the first test run. Softsusy takes an input parameter card with supersymmetry-breaking parameters in the *SUSY Les Houches Accord* format, and produces the MSSM SUSY mass spectrum as output. An example input card for the CMSSM point *sps1a* is available at http://softsusy.hepforge.org/code/lesHouchesInput. Save it as :file:`lesHouchesInput.slha`:: # Example input in Les Houches format, and suitable for input to # SOFTSUSY (v1.8 or higher): SPS1a input Block MODSEL # Select model 1 1 # sugra Block SMINPUTS # Standard Model inputs 1 1.279340000e+02 # alpha^(-1) SM MSbar(MZ) 2 1.166370000e-05 # G_Fermi 3 1.172000000e-01 # alpha_s(MZ) SM MSbar 4 9.118760000e+01 # MZ(pole) 5 4.250000000e+00 # mb(mb) SM MSbar 6 1.743000000e+02 # mtop(pole) 7 1.777000000e+00 # mtau(pole) Block MINPAR # Input parameters 1 1.000000000e+02 # m0 2 2.500000000e+02 # m12 3 1.000000000e+01 # tanb 4 1.000000000e+00 # sign(mu) 5 -1.000000000e+02 # A0 Block SOFTSUSY # Optional SOFTSUSY-specific parameters 1 1.000000000e-03 # Numerical precision: suggested range 10^(-3...-6) 2 0.000000000e+00 # Quark mixing parameter: see manual 5 1.000000000e+00 # Include 2-loop scalar mass squared/trilinear RGEs The interesting block is :kbd:`MINPAR`, the five free high-scale parameters of the CMSSM. To get the output spectrum, run:: $ export PATH=$PATH:$HOME/local/bin # (why?) $ softpoint.x leshouches < lesHouchesInput.slha > sps1a.slha For us, the interesting part of the output are the sparticle masses in :kbd:`BLOCK MASS`:: #PDG code mass particle 24 8.03907025e+01 # MW 25 1.26926792e+02 # h0 35 1.67333827e+03 # H0 36 1.67326915e+03 # A0 37 1.67527975e+03 # H+ 1000001 1.14434683e+03 # ~d_L 1000002 1.10479594e+03 # ~u_L [...] Try to extract only the lines with the masses from the slha file, using some of the command line tools we used previously. A summary can be found in the section :ref:`simple-data-handling`. Installation of a Python package -------------------------------- We can diplay SUSY masses conveniently with pySLHA, from http://www.insectnation.org/projects/pyslha. Like many python modules, it can be installed very easily from the Python package index http://pypi.python.org/, with the :command:`easy_install` tool:: $ easy_install --prefix=$HOME/local pyslha Again, we need :option:`--prefix` to choose an installation location. Most likely, :command:`easy_install` will fail the first tome you try it, with a long error message about an unknown directory ending in :file:`python2.x/site-packages`. This is the local directory where Python will look for extra packages that you have installed. Since this is your first additional package, you'll have to create the directory first, and tell Python to look there:: $ mkdir -p \$HOME/local/lib/python2.6/site-packages $ export PYTHONPATH=\$HOME/local/lib/python2.6/site-packages Rerun the :command:`easy_install` command, which should go through now. In :file:`$HOME/local/bin`, you will find a program called :command:`slhaplot`. pySLHA test ----------- We're now ready to plot the *sps1a* SUSY spectrum. .. To work around a bug in \cmd{vega}'s \LaTeX installation, export this additional \LaTeX setup directory first:\\ \inp{export TEXMFHOME=/home/hudson/staff/dph1dg1/texmf}\\ The command is now simply:: $ slhaplot sps1a.slha Have a look at the PDF file it produces. Repeated runs ------------- Now, let's go to the first part of the main task. Running one SLHA point is easy enough, but what about 500 points? Try to create a set of scripts in Bash (or Python), and maybe Makefiles that would make the task easy for you. Given a directory full of input SLHA files, run them all through Softsusy and keep track of the outputs in a systematic way. Mass contour plots ------------------ You already can run :command:`Softsusy` for many different parameter points. Now make a scan over two of the parameters (:math:`m_0` and :math:`m_{1/2}`), and from the results generate a contour plot for different sparticle masses, like the gluino, neutralino, stau or stop. Then generate a single pdf document with these contour plots. Ideally, keep your scripts flexible enough to make it easy to plot other masses, scan over different parameters, etc. There are many different options to create contour plots. You could certainly use :command:`gnuplot`, which we already introduced earlier. NCL (NCAR Command Language) is an example for an interpreted language designed specifically for scientific data processing and visualisation. For today, we suggest that you use Python's *matplotlib*. Also for merging the final plots into a single pdf document you can choose from a variety of options. :command:`pdftk` is a very convenient tool for manipulating pdf documents [#fn2]_, and is installed on most modern Linux systems. In principle you could use :command:`convert` from the ImageMagick package, but it will probably convert your vector graphics into bitmaps and therefore inflate the file size by orders of magnitude (see the section on :ref:`image-manipulation`). Finally you could generate a :math:`\LaTeX` document and insert the plots using :kbd:`\includegraphics`. .. rubric:: Footnotes .. [#fn2] From their website: "If PDF is electronic paper, then pdftk is an electronic staple-remover, hole-punch, binder, secret-decoder-ring, and X-Ray-glasses" Summary of useful tools ======================= .. _simple-data-handling: Simple data handling -------------------- Using shell tools to extract and store data as plain text can be a useful method, especially if the numbers you're looking for only exist in the arcane output of some pheno code you've downloaded! For the record, the tools of most use are :command:`grep`, :command:`sed`, :command:`awk`, :command:`cut`, :command:`paste`, :command:`sort` and :command:`column`, as well as the other standard shell idioms and tools like *regular expressions*. Using these, you can extract numbers from almost arbitrarily structured text and place them in some nice, simple format like columned plain text. Python matplotlib ----------------- .. highlight:: python A simple interface to producing plots is the Python Matplotlib package. Matplotlib is made all the more powerful by its integration with the NumPy and SciPy packages:, a powerful set of numerical/scientific data handling objects written (for speed) in Fortran/C/C++ and accessed through Python. NumPy is a *very* efficient numerical processing framework. The best place to learn your way around NumPy and Matplotlib (most easily accessed in Python as a module called :kbd:`pylab`) is by going to the corresponding websites (see the References) and following some of the tutorials and "cookbook" entries. However, here's a couple of examples that should give you some idea of the ease of using these modules and the quality of their output. Specifically, we're going to generate a contour plot of a 2D skewed Gaussian. We'll start by importing the modules and making a 2D grid of points from :math:`-1` to :math:`+1` with point spacings of :math:`0.01` in each direction:: from scipy import ogrid, stats import numpy as np import pylab as pl x,y = ogrid[-1:1:0.01, -1:1:0.01] Now we'll fetch a Gaussian distribution and sample from it in 2 dimensions (note how the functions implicitly handle vectors without needing to iterate over the elements), then use the matplotlib :kbd:`imshow` function to make a pretty colour map of the 2D Gaussian:: gauss = stats.distributions.norm(loc=0, scale=0.5) z = gauss.pdf(np.sqrt(x**2 + 4*y**2)) pl.imshow(z, origin="lower", extent=[-1,1,-1,1]) pl.show() Now we'll rotate the axes (there is certainly a neater way than this --- I've just implemented the rotation matrix manually! --- but this is simple):: theta = (np.pi / 180) * 30 xprime = np.cos(theta) * x - np.sin(theta) * y yprime = np.sin(theta) * x + np.cos(theta) * y zprime = gauss.pdf(np.sqrt(xprime**2 + 4*yprime**2)) pl.imshow(zprime, origin="lower", extent=[-1,1,-1,1]) pl.hold(True) pl.contour(zprime, origin="lower", extent=[-1,1,-1,1]) where the :samp:`pl.hold(True)` was used to make sure that the colour plot didn't get erased when we drew the contours. Now we'll save in various formats:: pl.savefig("rotgauss") # saves as PNG by default pl.savefig("rotgauss.png") # as above pl.savefig("rotgauss.eps") # saving as EPS pl.savefig("rotgauss.svg") # saving as Scalable Vector Graphics If making a matplotlib plot for publication, the most reliable colour scheme is still the greyscale one. If you go for colour, keep in mind that only two-tone palettes have a reliable psychological ordering. You can get change palette with functions like:: pl.grey() An example output is shown here. .. image:: rotgauss.png .. \begin{figure}[ht] \begin{center} \includegraphics[width=0.5\textwidth]{rotgauss} \end{center} \caption{Rotated 2D Gaussian as colour plot with contours generated by SciPy with \kbd{mpl}} \label{fig:SkewedGaussian} \end{figure} .. _image-manipulation: A bit of image manipulation --------------------------- We haven't got time to do a proper lecture on image formats and the like, which are always a bit of a mystery to first time users. Roughly speaking, there are two distinct classes of image file: :dfn:`bitmap` files which describe the colour of individual pixels in the image and :dfn:`vector` formats which store parametric descriptions of drawing objects. You would probably guess that making bitmaps from vector formats is easy and the reverse is rather harder. For what it's worth, the :program:`Inkscape` program includes a very impressive bitmap-to-vector input filter. The big operational difference between vectors and bitmaps is that vector formats can be arbitrarily scaled without losing smoothness, whereas bitmaps are intrinsically pixelated. For graphs and diagrams in papers and theses, vector graphics, usually in encapsulated PostScript or PDF format, are the best choice. Only use bitmaps (like JPEG, PNG, GIF) if you have no option, such as with a photograph or a bitmap image that you got from e.g. the Web. For data graphics, you almost certainly want to use PNG rather than JPEG: the latter is designed for photos and will produce nasty compression artefacts such as blotchy areas and fuzziness around lines on plots with areas of uniform block colour. .. highlight:: console Let's do a bit of format conversion on the command line: - To convert a "normal" PostScript file, which assumes that it occupies a full page to an :dfn:`encapsulated PostScript` (or EPS) file to use it in, say, a :math:`\LaTeX` document: use the :command:`ps2epsi` program and the line:: for i in *.epsi; do mv $i ${i%i}; done to an EPS file with the standard :file:`.eps` file extension. - From EPS, if you're writing a document (such as :math:`\LaTeX` Beamer slides for a conference talk) using :command:`pdflatex`, you'll want to have your vector graphics in PDF format. You can convert EPS files to PDF using the :command:`epstopdf` tool. - A useful set of bitmap manipulation utilities is the *ImageMagick* set of command line programs. They will also do some limited processing of vector formats like EPS and SVG. The programs include :command:`identify` for getting information about an image file, :command:`convert` for changing between bitmap formats and :command:`mogrify` for generally messing with an image (cropping, resizing, blurring, sharpening etc.). They can be handy for many things, including sampling (E)PS and PDF files into a bitmap format for e.g. putting on the Web. Other data analysis packages ---------------------------- There are of course many other data analysis packages. Here are a few choice ones you might want to look at: R (http://www.r-project.org/) is a very well established statistics and data analysis package used, for example, in a lot of biometrics research (it's a Free implementation of a commercial stats language called S). It has a clever list/tuple-based interaction language and masses of statistical functions. I don't think the plot output is of publication quality, though, but as ever there's no harm in exporting to plain text and using a better presentation package. Even better, use the :kbd:`pyr` Python module and access all of R's functionality from within SciPy! Octave (https://www.gnu.org/software/octave/) Matlab is widely used among engineers and other researchers but never seems to have taken off in HEP. It provides a lot of powerful array processing functionality and good plotting. A Free implementation, called Octave, is available but uses gnuplot for its plotting, which hence looks a bit rubbish. Matplotlib implements much of its plotting functionality, so maybe Octave+SciPy/matplotlib is a good combination. Sage (http://www.sagemath.org/) A fairly new python-based integration of lots of scientific tools. To quote their website: Sage is a free open-source mathematics software system [...] It combines the power of many existing open-source packages into a common Python-based interface. Mission: *Creating a viable free open source alternative to Magma, Maple, Mathematica and Matlab*. Drawings -------- We haven't the time to talk about graphics other than plotting, though we managed to say a bit about vector and bitmap graphics formats earlier. Here are a few suggestions for how to produce other graphics for your talks, papers and theses: Postscript PS is actually a human readable, Turing-complete language in its own right. It is stack-based, which is probably a rather unfamiliar style of programming to most people, but it's strangely satisfying. You can do surprisingly complicated things in it by hand, or generate *very* complex things programatically if you're ingenious enough (the program will execute on the printer's CPU!). Bill Casselman's `Mathematical Illustrations `_ book and the various PostScript manuals (all available online) are excellent references for this. Inkscape a relatively recent development in the Linux world, and a truly excellent interactive vector drawing package with a very intuitive interface and bags of functionality. Marvellous. .. JaxoDraw and Feyndiagram for producing Feynman diagrams, there are several possible approaches. You can take the graphics direct from FeynArts, write them directly in your \TeX{} source using the \texpkg{feynmf} or \texpkg{axodraw} \LaTeX{} packages\dots or you can use one of these programs. Jaxodraw is a nice Java GUI on top of \texpkg{axodraw}, and Feyndiagram (\url{http://www.feyndiagram.com/}) is a \CC class-based way to write your diagrams. Visual Python A great interactive interface for 3D data, including zoom and rotation functionality. Python OpenGL/Cairo OpenGL is a 3D graphics standard and there are now some nice Python interfaces to it. Cairo is becoming a Free Software standard for lower-level 2D graphics and *pycairo* provides the interface to it. Graphviz this is a language for the graphical representation of node-edge networks. Or, more simply, if you tell it about various nodes connected by edges, it'll produce an optimally (for some definition of optimal!) laid-out node-and-line diagram. Very cool. xfig/Jfig xfig is an old-but-good vector drawing package, but is largely superseded by the likes of Inkscape these days. It has a rather odd interface, which is actually a lot cleverer than it looks and can be used to produce very nice output. Jfig is a recent implementation in Java, which can also be used as a Java graphics library. Gimp the GNU Image Manipulation Program is an excellent Free "PhotoShop-alike". It only manipulates 2D bitmap graphics, but is very good at it. ImageMagick we've already mentioned this swiss-army-knife of bitmap manipulation. pdftk Another swiss-army-knife: anything you'd ever want to do to PDF documents. Arbitrarily reshuffle pages, select from different sources to produce one output, rotate, crop, ... Have fun experimenting!