Recap: SciPy 2015

Recap: SciPy 2015 - Day 2

Wes McKinney held the key note speech on the second day. This talk was more of a retrospective, personal journey with a view on the future for python and the greater data science community. Interestingly, some of the tools seem to have started a "long time ago" - 2008. Wes talked about 2011 being the year when Pandas development took off again. Thinking about my own history, I joined Amyris in 2011 as part of the Enzymology department which doesn't feel that long ago. Pandas bug/design fixes, and data wrangling capabilities were implemented from June 2011 to July 2012, which is just 3 months before I joined the software engineering department, and that feels really recent.

Phillip Cloud gave a talk on Blaze and Odo. Blaze is an interface for data-centric computation. It consists of expressions and compute recipes and follows similar design principles as can be found in dplyr or SQLAlchemy. Blaze expressions describe table. Compute recipes contain the correct implementation for the correct backend.Odo is a library for doing set conversion.

Stephan Hoyer talked about xray. Xray was motivated by the need to work on large, multi-dimensional scientific data with lots of labels/structure coming from climate science. It's pandas like but for multi-dimensions. The main data structures are DataArray (like pandas.Series) and Dataset (like pandas.DataFrame). Xray works with Dask.

In the afternoon, I attended 2 Computational Biology talks.

Jai Ram Rideout and Evan Boylen introduced Scikit-bio a new bioinformatics library currently in beta development. Being a "scikit" package, this package is designed to work with other current python ecosystem modules like numpy, scipy, pandas, and scikit-learn. Note that Biopython is not designed to work with the numpy, scipy ecosystem. I found it interesting that the core coders of this project specifically stated that they wanted to create a bioinformatics package that is coded to higher standards than the usual bioinformatics package while they mentioned their release cycle schema. Because of this I think this is a package to keep an eye out for on.

Alex Rubinsteyn talked about PyEnsemble and Varcode. PyEnsembl is a python wrapper for different Ensembl genome annotations. Varcode compares collections of variants between WT and mutant. This currently only works for human genomes but the plan is to generalize to other organisms. There seem to be performance issues with large number of sequences.

Zubin Dowlaty energetic talk about leveraging design thinking for building scalable enterprise intelligent systems. Due to data explosion, there is a need to develop scalable predictive applications to provide an edge. He notes that the data warehouse model is a failed model. Dashboards are boring. One of his points was that all common robust methods should be used for any problem. Lots of words, enthusiasm and big ideas...but I am not sure I understood what he was trying to say :-)

Thomas Caswell gave the state of the library. In short , the project is still active and alive. 1.5 release will happen at the end of the month which contains a new default color map. A 2.0 release is targeted for September. A 2.1 release is planned for in March 2016. Matplotlib started shipping an interactive Ipython backend with V1.4.3. 3D interactive graphs looked a lot more choppy compared to Bokeh though this might be a system difference I am seeing. Seaborn style will come by default with V1.5. A graph now does not need to be redrawn if some of the properties are updated like line thickness and style. Matplotlib 1.5 and higher will support Python 3. This talk is worth checking out for all matplotlib users because there really are a lot of new features now available.

Stephen Hoover works a cloud-based data science platform that handles everything from data import, to data query, to predictive analytics to automation of the analysis pipeline. The web interface is written with Ruby and JS but the predictive modeling is done in Python 3 not 2.7! The learning lessons are probably useful for people on the Data Science/Scientific Computing group to watch. An interesting remark is that historically R used to have a lot more data analysis packages available but the difference is rapidly disappearing.

Jaime Huerta-Cepas talked about his package called ETE, which is a comprehensive environment for handling and visualizing tree structure. The package contains built-in functions to traverse, annotate, modify, calculating distances, perform tree comparisons, and visualize trees (by generating PNG, PDF or SVG images). Interactive tree images can also be generated. Currently, browser view and Ipython is not in production, but the author has been thinking about it and experimenting with it. Perhaps worth checking out if you have to come up with phylogenetic trees based on large alignment data.

If you want to read summaries of other days, you can read them here:

Day 1 Summary
Day 3 Summary

Bio-Based Ideas

Search This Blog

Recap: SciPy 2015 - Day 2

Labels

Comments

Post a Comment

Popular posts from this blog

Focus on Algae - Part I: Bioremediation

Permaculture: nature is still smarter than us

Sustainable Living - One Step at a time: Toilet Paper