Tuesday, July 14, 2015

Freely Speaking: What does SciPy have to do with bio-based ideas??!

I was recently asked the above question. And it's a totally valid question as SciPy is somewhat outside of what I usually write about (biological topics, sustainable topics). There is a logical connection though, and it has to do with what I do at work.

Building biological entities is difficult because unlike car design, biological entities like to "misbehave". I say misbehave but it actually only means that we don't sufficiently understand microorganisms well enough to model them perfectly.

This is where data science comes in of course. Through collection of large data sets data science and related fields can help us uncover patterns not seen before. these patterns then help make a better yeast model. Better models = faster product development. Faster product development = faster route to a sustainable business = more products with a positive impact.

So there is a link. Simple, right?

Monday, July 13, 2015

Recap: SciPy 2015 - Synopsis

SciPy 2015 has come and gone. If I step back, what are some of the learning lessons?

There were certain themes that recurred from talk to talk:

  1. Speed. One of the perceived limitations of Python seems to be speed of execution which is important to process very large datasets. Many talks dealt with this topic in various ways. Some of the approaches included enabling process parallelization (Dask, DistArray), GPU-acceleration (VisPy), or acceleration via some means of compilation - sometimes just-in-times (Numba). With these tools, Python is no longer slow. It's impressive that the combination of these approaches has enabled data scientists to process 60+ GB data sets as if they would be loaded into memory on one small laptop that actually only has 8-16 GB of memory.
  2. Visualization was a theme. So many talks dealt with making complex data sets visible, and they did so to address different issues: Serialization to enable interactivity (Bokeh, matplotlib), visualizing large dynamic datasets with low latency (VisPy), making it easier for normal scientists to view and share visual graphs (Holoviews).
  3. Reproducibility and transparency of data analysis in science which includes the concept of literate programming was a theme. Again, different solutions were presented. The solutions ranged from ensuring that data analysis was performed on the same configurations on different levels (Docker, Dexter), to striking the right balance between enough metadata while also having beautiful formatting, to reducing error rates by automating execution of different simulation scenarios.

Sunday, July 12, 2015

Recap: SciPy 2015 - Day 3

Jake Van Der Plaas, one of the main contributors of anything from numpy to scipy, scikit-learn, mpld3 etc., gave the key note speech on the third day of the conference talking about the state of scientific computing in Python. As a side note, Jake is a senior scientist and Director Research at the eSciences at University of Washington. I hear that he is currently involved in developing the data analysis pipeline for the LSST as well.

Thursday, July 9, 2015

Recap: SciPy 2015 - Day 2

Wes McKinney held the key note speech on the second day. This talk was more of a retrospective, personal journey with a view on the future for python and the greater data science community. Interestingly, some of the tools seem to have started a "long time ago" - 2008. Wes talked about 2011 being the year when Pandas development took off again. Thinking about my own history, I joined Amyris in 2011 as part of the Enzymology department which doesn't feel that long ago. Pandas bug/design fixes, and data wrangling capabilities were implemented from June 2011 to July 2012, which is just 3 months before I joined the software engineering department, and that feels really recent.