Skip to main content

Recap: SciPy 2015 - Day 2

Wes McKinney held the key note speech on the second day. This talk was more of a retrospective, personal journey with a view on the future for python and the greater data science community. Interestingly, some of the tools seem to have started a "long time ago" - 2008. Wes talked about 2011 being the year when Pandas development took off again. Thinking about my own history, I joined Amyris in 2011 as part of the Enzymology department which doesn't feel that long ago. Pandas bug/design fixes, and data wrangling capabilities were implemented from June 2011 to July 2012, which is just 3 months before I joined the software engineering department, and that feels really recent.

Phillip Cloud gave a talk on Blaze and Odo. Blaze is an interface for data-centric computation. It consists of expressions and compute recipes and follows similar design principles as can be found in dplyr or SQLAlchemy. Blaze expressions describe table. Compute recipes contain the correct implementation for the correct backend.Odo is a library for doing set conversion.

Stephan Hoyer talked about xray. Xray was motivated by the need to work on large, multi-dimensional scientific data with lots of labels/structure coming from climate science. It's pandas like but for multi-dimensions. The main data structures are DataArray (like pandas.Series) and Dataset (like pandas.DataFrame). Xray works with Dask.

In the afternoon, I attended 2 Computational Biology talks.

Jai Ram Rideout and Evan Boylen introduced Scikit-bio a new bioinformatics library currently in beta development. Being a "scikit" package, this package is designed to work with other current python ecosystem modules like numpy, scipy, pandas, and scikit-learn. Note that Biopython is not designed to work with the numpy, scipy ecosystem. I found it interesting that the core coders of this project specifically stated that they wanted to create a bioinformatics package that is coded to higher standards than the usual bioinformatics package while they mentioned their release cycle schema. Because of this I think this is a package to keep an eye out for on.

Alex Rubinsteyn talked about PyEnsemble and Varcode. PyEnsembl is a python wrapper for different Ensembl genome annotations. Varcode compares collections of variants between WT and mutant. This currently only works for human genomes but the plan is to generalize to other organisms. There seem to be performance issues with large number of sequences.

Zubin Dowlaty energetic talk about leveraging design thinking for building scalable enterprise intelligent systems. Due to data explosion, there is a need to develop scalable predictive applications to provide an edge. He notes that the data warehouse model is a failed model. Dashboards are boring. One of his points was that all common robust methods should be used for any problem. Lots of words, enthusiasm and big ideas...but I am not sure I understood what he was trying to say :-)

Thomas Caswell gave the state of the library. In short , the project is still active and alive. 1.5 release will happen at the end of the month which contains a new default color map. A 2.0 release is targeted for September. A 2.1 release is planned for in March 2016. Matplotlib started shipping an interactive Ipython backend with V1.4.3. 3D interactive graphs looked a lot more choppy compared to Bokeh though this might be a system difference I am seeing. Seaborn style will come by default with V1.5. A graph now does not need to be redrawn if some of the properties are updated like line thickness and style. Matplotlib 1.5 and higher will support Python 3. This talk is worth checking out for all matplotlib users because there really are a lot of  new features now available.

Stephen Hoover works a cloud-based data science platform that handles everything from data import, to data query, to predictive analytics to automation of the analysis pipeline. The web interface is written with Ruby and JS but the predictive modeling is done in Python 3 not 2.7! The learning lessons are probably useful for people on the Data Science/Scientific Computing group to watch. An interesting remark is that historically R used to have a lot more data analysis packages available but the difference is rapidly disappearing.

Jaime Huerta-Cepas talked about his package called ETE, which is a comprehensive environment for handling and visualizing tree structure. The package contains built-in functions to traverse, annotate, modify, calculating distances, perform tree comparisons, and visualize trees (by generating PNG, PDF or SVG images). Interactive tree images can also be generated. Currently, browser view and Ipython is not in production, but the author has been thinking about it and experimenting with it. Perhaps worth checking out if you have to come up with phylogenetic trees based on large alignment data.

If you want to read summaries of other days, you can read them here:

Day 1 Summary
Day 3 Summary

Popular posts from this blog

Sustainable Living: One man's trash...

Since Earth Week is starting tomorrow, I wanted share with you some concrete ways of how individuals like you and me can make an impact on a wider scale. I then also wanted to use this example to challenge everyone to think creatively about the larger context.

So you know how the saying goes: "One man's trash is another one's treasure." Today, I want to talk to you about garbage. Plastic garbage specifically. Plastic is quite a wondrous material. Made from oil by man with just a few additives can turn this polymer into so many different sorts of plastics with so many different properties from thin and flimsy plastic bags, to the carpet on which I am standing, to this plastic bottle from which I am drinking.

Focus on Algae - Part I: Bioremediation

After spending the last few blog posts on different aspects of dissimilatory bacteria, I want to switch the focus to a different class of organisms I have been interested in for a long time now. These are the algae. Algae comprise a large diversity of "sea weeds" and an even larger variety of single-celled organisms that mostly are capable of doing photosynthesis. They include the ordinary sea-weed, and make up a portion of the green slime found around the edges and the bottom of a pond. More exotic types of algae can live symbiotically - that is together with another organism in a mutually beneficial way. Lichens are an example of symbiotic relationship between algae and fungi. More information about the evolution and lineage of algae can be found in this wiki article.
Image via Wikipedia
Typically, these organisms are either not mentioned at all or only in conjunction with toxic algal blooms. But lately, algae, of course, have been in the news recently because of the promi…

Freely-Speaking: On the need to act with urgency.

I just read this article on the Great Barrier Reef suffering irreversible damage from climate disruption. It moved me so much that I just had to quickly post an appeal to anyone who happened to be reading this blog:

The changes happening to our environment are real, massive, and definitely caused in very large parts by human action (e.g. burning of fossil fuels for transportation, and energy, deforestation etc.) and made worse by inaction (e.g.: governments twiddling their thumbs and ignoring the problem, or afraid of shaking up the status quo).

There is some good news to all of this too though: Since it is humans causing this problem, it is also up to us to do everything in our power to fix these problems. And since Earth Week is also coming up, I would like to appeal to everyone to move to action.