Skip to main content

Recap: SciPy 2015 - Day 2

Wes McKinney held the key note speech on the second day. This talk was more of a retrospective, personal journey with a view on the future for python and the greater data science community. Interestingly, some of the tools seem to have started a "long time ago" - 2008. Wes talked about 2011 being the year when Pandas development took off again. Thinking about my own history, I joined Amyris in 2011 as part of the Enzymology department which doesn't feel that long ago. Pandas bug/design fixes, and data wrangling capabilities were implemented from June 2011 to July 2012, which is just 3 months before I joined the software engineering department, and that feels really recent.

Phillip Cloud gave a talk on Blaze and Odo. Blaze is an interface for data-centric computation. It consists of expressions and compute recipes and follows similar design principles as can be found in dplyr or SQLAlchemy. Blaze expressions describe table. Compute recipes contain the correct implementation for the correct backend.Odo is a library for doing set conversion.

Stephan Hoyer talked about xray. Xray was motivated by the need to work on large, multi-dimensional scientific data with lots of labels/structure coming from climate science. It's pandas like but for multi-dimensions. The main data structures are DataArray (like pandas.Series) and Dataset (like pandas.DataFrame). Xray works with Dask.

In the afternoon, I attended 2 Computational Biology talks.

Jai Ram Rideout and Evan Boylen introduced Scikit-bio a new bioinformatics library currently in beta development. Being a "scikit" package, this package is designed to work with other current python ecosystem modules like numpy, scipy, pandas, and scikit-learn. Note that Biopython is not designed to work with the numpy, scipy ecosystem. I found it interesting that the core coders of this project specifically stated that they wanted to create a bioinformatics package that is coded to higher standards than the usual bioinformatics package while they mentioned their release cycle schema. Because of this I think this is a package to keep an eye out for on.

Alex Rubinsteyn talked about PyEnsemble and Varcode. PyEnsembl is a python wrapper for different Ensembl genome annotations. Varcode compares collections of variants between WT and mutant. This currently only works for human genomes but the plan is to generalize to other organisms. There seem to be performance issues with large number of sequences.

Zubin Dowlaty energetic talk about leveraging design thinking for building scalable enterprise intelligent systems. Due to data explosion, there is a need to develop scalable predictive applications to provide an edge. He notes that the data warehouse model is a failed model. Dashboards are boring. One of his points was that all common robust methods should be used for any problem. Lots of words, enthusiasm and big ideas...but I am not sure I understood what he was trying to say :-)

Thomas Caswell gave the state of the library. In short , the project is still active and alive. 1.5 release will happen at the end of the month which contains a new default color map. A 2.0 release is targeted for September. A 2.1 release is planned for in March 2016. Matplotlib started shipping an interactive Ipython backend with V1.4.3. 3D interactive graphs looked a lot more choppy compared to Bokeh though this might be a system difference I am seeing. Seaborn style will come by default with V1.5. A graph now does not need to be redrawn if some of the properties are updated like line thickness and style. Matplotlib 1.5 and higher will support Python 3. This talk is worth checking out for all matplotlib users because there really are a lot of  new features now available.

Stephen Hoover works a cloud-based data science platform that handles everything from data import, to data query, to predictive analytics to automation of the analysis pipeline. The web interface is written with Ruby and JS but the predictive modeling is done in Python 3 not 2.7! The learning lessons are probably useful for people on the Data Science/Scientific Computing group to watch. An interesting remark is that historically R used to have a lot more data analysis packages available but the difference is rapidly disappearing.

Jaime Huerta-Cepas talked about his package called ETE, which is a comprehensive environment for handling and visualizing tree structure. The package contains built-in functions to traverse, annotate, modify, calculating distances, perform tree comparisons, and visualize trees (by generating PNG, PDF or SVG images). Interactive tree images can also be generated. Currently, browser view and Ipython is not in production, but the author has been thinking about it and experimenting with it. Perhaps worth checking out if you have to come up with phylogenetic trees based on large alignment data.

If you want to read summaries of other days, you can read them here:

Day 1 Summary
Day 3 Summary

Popular posts from this blog

Sustainable Living: Sunscreens

This is an important topic and so I want to get the most important things out of the way first:

Chemical sunscreens containing the following ingredients contribute to coral bleaching: 
OxybenzoneOctinoxateOctocrylene (used to also stabilize avobenzone)4-methylbenzylidine camphorAnything containing Parabens Don't be part of the problem and avoid using them! It's important to note that claims on sunscreens are not regulated and therefore, companies can put the wording "coral reef safe" on the packaging even though they contain the above chemicals. This is misleading if not outright false. Instead use "physical" sun screens that contain non-nanoparticle zink oxide. Physical sun screens differ from chemical sunscreens in that the sit ontop of the skin to reflect or scatter UVA/B rays away from the skin before it reaches it. Chemical sunscreens absorb the UVA/B rays instead to neutralize them.

To be clear, I am not proposing not using sunscreen! Instead use phys…

Focus on Algae - Part II: Energy

In the last focus section, we discussed how algae can be used to treat waste waters and mitigate CO2 in the process. Today's post will explore how algae can be used for energy generation. As already mentioned in the last time, biofuels have become very visible as of late due to environmental, economical and geopolitcal reasons. If at the heart of traditional biofuel generation lies in the creation and decomposition of biomass, then it would be easy to substitute corn or other less controversial land-based plants with algae. Although a lot of attention is paid to the use of algae in biofuel generation, and this article also mainly focusses on this aspect, it should be noted that algae can also be used to generate electricity by direct combustion of the biomass. Plans for these kinds of schemes are already on the way in Venice and a few other European locations [1].

Algae and Biofuels

What happens to the biomass after it has been created depends on the type of biofuel that is desired…

Sustainable Living: One man's trash...

Since Earth Week is starting tomorrow, I wanted share with you some concrete ways of how individuals like you and me can make an impact on a wider scale. I then also wanted to use this example to challenge everyone to think creatively about the larger context.

So you know how the saying goes: "One man's trash is another one's treasure." Today, I want to talk to you about garbage. Plastic garbage specifically. Plastic is quite a wondrous material. Made from oil by man with just a few additives can turn this polymer into so many different sorts of plastics with so many different properties from thin and flimsy plastic bags, to the carpet on which I am standing, to this plastic bottle from which I am drinking.