Skip to main content

Recap: SciPy 2015 - Day 3

Jake Van Der Plaas, one of the main contributors of anything from numpy to scipy, scikit-learn, mpld3 etc., gave the key note speech on the third day of the conference talking about the state of scientific computing in Python. As a side note, Jake is a senior scientist and Director Research at the eSciences at University of Washington. I hear that he is currently involved in developing the data analysis pipeline for the LSST as well.

According to him, the Python ecosystem consists out of several layers. On the core layer, there is Python of course. Ipython, Numpy, Cython, and Jupyter can be considered the core of scientific compting. On the third layer, we would have modules like Scipy, matplotlib, pandas and SumPy
 that are foundational to other more domain-specific modules, amongst which are statsmodel, scikit-image, scikit-learn, PyTables, NetworkX, PyMC and many, many more.

Jake was excited for all the exciting projects in the Python eco system right now. On the performance front, tools like Numba, Weave, Numepr, Theano have given Python performance boosts not seen before.

Jake noted that most important packages have moved to Python 3. Because of that he noted that the time to move to Python 3 is now though he noted that just 2 years ago he would not have said so.

On the visualizing methods, Matplotlib is evolving into a more modern package. He specifically mentioned the style sheets that will be available with V1.5. He mentioned Seaborn's unique abilities (Matplotlib + seaborn + pandas) and beautiful default styles. He highlighted the progress of Bokeh for interactive graphs.

He mentioned Xray. He likes Xray's multi-dimensionality system. He likes Dask's task graphs that enables multi-processing of other things like pandas operations. He mentioned Numba which is a JIT compiler compiling into LLVM code at near C/Fortran speed for a 20X speed increase.

He likes the conda distribution and packaging system because it is like pip but also manages non-python dependcies. It is like like virtualenv but also allows for different compiled versions.

He also gave cheers to Ipython/Jupyter. This tool has been so successful that it has branched out to other languages as well like R and Julia. In fact, Jupyter is short for Julia, Python and R.

Next, Jake dove into the history of the SciPy ecosystem.

I learned that Python was originally developed as a teaching language by Guido Van Rossum in 1980. How did a toy language develop into the preferred language to for scientific computing. Pre-python, a hodgepodge of tools were used. Many people used python as glue to tie together the hodgepodge of different tools (Fortran/C libraries). What drew people to Python was not speed of execution (done by C/Fortran)  but by speed of development.

But the efficiency depends on the scientific stack which started with the development of Numeric in 1995. In 1998, Travis Oliphant built wrappers on top of Fortran. In 2002, Perry Greenfield developed Numarray to address issues found in Numeric for larger data sets. Having 2 numeric arrays was a threat to split the nascent community in 2, but Travis managed to unify the to into Numpy. Meanwhile, SciPy was developed in 2000 as a Matlab replacement. In parallel, Fernando Perez started developed Ipython in 2001. And in 2002, Tony Hunter developed matplotlib to port graphical abilities of Matlab to Python. In 2009, Pandas development started which has helped tremendously with Python popularity.

Scikit and conda in 2012. In short, the ecosystem was developed in a federated way over the last 10 years. It took deliberate coherent effort to develop a complementary ecosystem.

Lessons learned:

1.) There is no centralized leadership. What is core in the ecosystem is up to the community. As an example, he is wondering if Numba (not a Continuum project though it was started by them) should be pushed into the core despite many features that still need development.
2.) The most useful ecosystem must be willing to adapt. As an example, Jake mused whether Pandas should be part of the core or Pandas/Xray. If that were the case, then parts of seaborn could be moved into matplotlib. Scypi was supposed to be a monolithic package to get started fast. But he wonders if with conda could scipy should be broken down into its components.
3.) Interoperability with core pieces of other languages. Wow: if graphs can be serialized this enables cross-language collaborations. Jupyter is a prime example. Perhaps, Jake speculated about the importance and feasibility of having a universal data frame that could be understood by Julia, R and Python..
4.) Innovation comes through continuous (e.g.: numpy) and disruptive changes (e.g.: pandas, matplotlib). And both are important.

As an example, he considered the future of matplotlib. There are issues that are well resolved by existing packages. Seaborn addresses non-optimal stylistic defaults. The non-optimal API is addressed by ggplot2. The recent serialization has been addressed by evolving matplotlib to have serialization. The only issue that apparently remains unresolved issue is how large data, and Jake was not sure how this could be resolved. Evolution? Replacing?

Jake concluded the talk by saying that the feature is up to us.

This was THE best keynote because it provided a wide perspective and had a lot of reflection on a wide scale concerning the SciPy ecosystem.

You can read his slide deck here.

Carlos Cordobas, current maintainer of Spyder, is from Continuum Analytics and gave a talk on better documentation for scientific Python. According to him that scientific computing is exploratory computing. So access to documentation and analysis transparently should be easy. Ipython/Jupyter has lots of metadata but it's poorly formatted. So Spyder was developed which is better formated but has poor metadata. The question is how to improve the situation. Docrepr which uses Ipythong metadata and Spyder formatting. It receives docstrings and megtadata as a dictionary. Uses sphinx to parse docstrings as html also adds contents to an html jinja template.

Damin Avila, core developer of Jupyter/Ipython, Bokeh, and Nikola talked about automatic releases based on his work with Bokeh which is a bit complex because it has both a Python and JS side. He described the old way of releasing a version which used conda which was mostly done manually and took hours. His solution of course consisted out of automating the process. For this he used TravisCl.

Stanley Seibert talked about acceleration with the Numba JIT Compiler. Need to review again. He was such a fast speaker but this is a really interesting talk worth watching.

Chris Fonnesbeck from Vanderbilt talked about statistical thinking in data science. Great talk going back to statistical and sampling issues. His main point: Big data does not solve self-selection bias issues. With increased sample size, you just get more precise but inaccurate data back. His advice is to use more model-based inference. Essentially, build models first to estimate things about things we care about first. We want to generate data-generating models specifically. Highly recommended to non-statisticians. I will rewatch this talk myself again as well just to let the material sink in properly.

John Readey talked about restful HDF. I didn't know what HDF5 was and some of the other talks I attended did not really clarify this for me either. This talk did it best by saying that it could be considered several things from C-API, to file format, to data model. John's motivation for development of a restful HDF5 was to enable users to access resources within an HDF5 file with just one simple URL ("Imagine instead of downloading a file and then importing into Ipython notebook, just getting the data with the link.). Other benefits include network transparency as well as the ability for multiple users to read/write HDF5 resources.

I did not attend the remaining talks as I was heading to the airport but all remaining talks can be accessed here:

SciPy 2015 Talks

If you want to read summaries from other days go here:

Day 1 Summary
Day 2 Summary

Popular posts from this blog

Sustainable Living: Sunscreens

This is an important topic and so I want to get the most important things out of the way first:

Chemical sunscreens containing the following ingredients contribute to coral bleaching: 
OxybenzoneOctinoxateOctocrylene (used to also stabilize avobenzone)4-methylbenzylidine camphorAnything containing Parabens Don't be part of the problem and avoid using them! It's important to note that claims on sunscreens are not regulated and therefore, companies can put the wording "coral reef safe" on the packaging even though they contain the above chemicals. This is misleading if not outright false. Instead use "physical" sun screens that contain non-nanoparticle zink oxide. Physical sun screens differ from chemical sunscreens in that the sit ontop of the skin to reflect or scatter UVA/B rays away from the skin before it reaches it. Chemical sunscreens absorb the UVA/B rays instead to neutralize them.

To be clear, I am not proposing not using sunscreen! Instead use phys…

Focus on Algae - Part II: Energy

In the last focus section, we discussed how algae can be used to treat waste waters and mitigate CO2 in the process. Today's post will explore how algae can be used for energy generation. As already mentioned in the last time, biofuels have become very visible as of late due to environmental, economical and geopolitcal reasons. If at the heart of traditional biofuel generation lies in the creation and decomposition of biomass, then it would be easy to substitute corn or other less controversial land-based plants with algae. Although a lot of attention is paid to the use of algae in biofuel generation, and this article also mainly focusses on this aspect, it should be noted that algae can also be used to generate electricity by direct combustion of the biomass. Plans for these kinds of schemes are already on the way in Venice and a few other European locations [1].

Algae and Biofuels

What happens to the biomass after it has been created depends on the type of biofuel that is desired…

Sustainable Living: One man's trash...

Since Earth Week is starting tomorrow, I wanted share with you some concrete ways of how individuals like you and me can make an impact on a wider scale. I then also wanted to use this example to challenge everyone to think creatively about the larger context.

So you know how the saying goes: "One man's trash is another one's treasure." Today, I want to talk to you about garbage. Plastic garbage specifically. Plastic is quite a wondrous material. Made from oil by man with just a few additives can turn this polymer into so many different sorts of plastics with so many different properties from thin and flimsy plastic bags, to the carpet on which I am standing, to this plastic bottle from which I am drinking.