Jake Van Der Plaas, one of the main contributors of anything from numpy to scipy, scikit-learn, mpld3 etc., gave the key note speech on the third day of the conference talking about the state of scientific computing in Python. As a side note, Jake is a senior scientist and Director Research at the eSciences at University of Washington. I hear that he is currently involved in developing the data analysis pipeline for the LSST as well.
According to him, the Python ecosystem consists out of several layers. On the core layer, there is Python of course. Ipython, Numpy, Cython, and Jupyter can be considered the core of scientific compting. On the third layer, we would have modules like Scipy, matplotlib, pandas and SumPy
that are foundational to other more domain-specific modules, amongst which are statsmodel, scikit-image, scikit-learn, PyTables, NetworkX, PyMC and many, many more.
Jake was excited for all the exciting projects in the Python eco system right now. On the performance front, tools like Numba, Weave, Numepr, Theano have given Python performance boosts not seen before.
Jake noted that most important packages have moved to Python 3. Because of that he noted that the time to move to Python 3 is now though he noted that just 2 years ago he would not have said so.
On the visualizing methods, Matplotlib is evolving into a more modern package. He specifically mentioned the style sheets that will be available with V1.5. He mentioned Seaborn's unique abilities (Matplotlib + seaborn + pandas) and beautiful default styles. He highlighted the progress of Bokeh for interactive graphs.
He mentioned Xray. He likes Xray's multi-dimensionality system. He likes Dask's task graphs that enables multi-processing of other things like pandas operations. He mentioned Numba which is a JIT compiler compiling into LLVM code at near C/Fortran speed for a 20X speed increase.
He likes the conda distribution and packaging system because it is like pip but also manages non-python dependcies. It is like like virtualenv but also allows for different compiled versions.
He also gave cheers to Ipython/Jupyter. This tool has been so successful that it has branched out to other languages as well like R and Julia. In fact, Jupyter is short for Julia, Python and R.
Next, Jake dove into the history of the SciPy ecosystem.
I learned that Python was originally developed as a teaching language by Guido Van Rossum in 1980. How did a toy language develop into the preferred language to for scientific computing. Pre-python, a hodgepodge of tools were used. Many people used python as glue to tie together the hodgepodge of different tools (Fortran/C libraries). What drew people to Python was not speed of execution (done by C/Fortran) but by speed of development.
But the efficiency depends on the scientific stack which started with the development of Numeric in 1995. In 1998, Travis Oliphant built wrappers on top of Fortran. In 2002, Perry Greenfield developed Numarray to address issues found in Numeric for larger data sets. Having 2 numeric arrays was a threat to split the nascent community in 2, but Travis managed to unify the to into Numpy. Meanwhile, SciPy was developed in 2000 as a Matlab replacement. In parallel, Fernando Perez started developed Ipython in 2001. And in 2002, Tony Hunter developed matplotlib to port graphical abilities of Matlab to Python. In 2009, Pandas development started which has helped tremendously with Python popularity.
Scikit and conda in 2012. In short, the ecosystem was developed in a federated way over the last 10 years. It took deliberate coherent effort to develop a complementary ecosystem.
Lessons learned:
1.) There is no centralized leadership. What is core in the ecosystem is up to the community. As an example, he is wondering if Numba (not a Continuum project though it was started by them) should be pushed into the core despite many features that still need development.
2.) The most useful ecosystem must be willing to adapt. As an example, Jake mused whether Pandas should be part of the core or Pandas/Xray. If that were the case, then parts of seaborn could be moved into matplotlib. Scypi was supposed to be a monolithic package to get started fast. But he wonders if with conda could scipy should be broken down into its components.
3.) Interoperability with core pieces of other languages. Wow: if graphs can be serialized this enables cross-language collaborations. Jupyter is a prime example. Perhaps, Jake speculated about the importance and feasibility of having a universal data frame that could be understood by Julia, R and Python..
4.) Innovation comes through continuous (e.g.: numpy) and disruptive changes (e.g.: pandas, matplotlib). And both are important.
As an example, he considered the future of matplotlib. There are issues that are well resolved by existing packages. Seaborn addresses non-optimal stylistic defaults. The non-optimal API is addressed by ggplot2. The recent serialization has been addressed by evolving matplotlib to have serialization. The only issue that apparently remains unresolved issue is how large data, and Jake was not sure how this could be resolved. Evolution? Replacing?
Jake concluded the talk by saying that the feature is up to us.
This was THE best keynote because it provided a wide perspective and had a lot of reflection on a wide scale concerning the SciPy ecosystem.
You can read his slide deck here.
Carlos Cordobas, current maintainer of Spyder, is from Continuum Analytics and gave a talk on better documentation for scientific Python. According to him that scientific computing is exploratory computing. So access to documentation and analysis transparently should be easy. Ipython/Jupyter has lots of metadata but it's poorly formatted. So Spyder was developed which is better formated but has poor metadata. The question is how to improve the situation. Docrepr which uses Ipythong metadata and Spyder formatting. It receives docstrings and megtadata as a dictionary. Uses sphinx to parse docstrings as html also adds contents to an html jinja template.
Damin Avila, core developer of Jupyter/Ipython, Bokeh, and Nikola talked about automatic releases based on his work with Bokeh which is a bit complex because it has both a Python and JS side. He described the old way of releasing a version which used conda which was mostly done manually and took hours. His solution of course consisted out of automating the process. For this he used TravisCl.
Stanley Seibert talked about acceleration with the Numba JIT Compiler. Need to review again. He was such a fast speaker but this is a really interesting talk worth watching.
Chris Fonnesbeck from Vanderbilt talked about statistical thinking in data science. Great talk going back to statistical and sampling issues. His main point: Big data does not solve self-selection bias issues. With increased sample size, you just get more precise but inaccurate data back. His advice is to use more model-based inference. Essentially, build models first to estimate things about things we care about first. We want to generate data-generating models specifically. Highly recommended to non-statisticians. I will rewatch this talk myself again as well just to let the material sink in properly.
John Readey talked about restful HDF. I didn't know what HDF5 was and some of the other talks I attended did not really clarify this for me either. This talk did it best by saying that it could be considered several things from C-API, to file format, to data model. John's motivation for development of a restful HDF5 was to enable users to access resources within an HDF5 file with just one simple URL ("Imagine instead of downloading a file and then importing into Ipython notebook, just getting the data with the link.). Other benefits include network transparency as well as the ability for multiple users to read/write HDF5 resources.
I did not attend the remaining talks as I was heading to the airport but all remaining talks can be accessed here:
SciPy 2015 Talks
If you want to read summaries from other days go here:
Day 1 Summary
Day 2 Summary
According to him, the Python ecosystem consists out of several layers. On the core layer, there is Python of course. Ipython, Numpy, Cython, and Jupyter can be considered the core of scientific compting. On the third layer, we would have modules like Scipy, matplotlib, pandas and SumPy
that are foundational to other more domain-specific modules, amongst which are statsmodel, scikit-image, scikit-learn, PyTables, NetworkX, PyMC and many, many more.
Jake was excited for all the exciting projects in the Python eco system right now. On the performance front, tools like Numba, Weave, Numepr, Theano have given Python performance boosts not seen before.
Jake noted that most important packages have moved to Python 3. Because of that he noted that the time to move to Python 3 is now though he noted that just 2 years ago he would not have said so.
On the visualizing methods, Matplotlib is evolving into a more modern package. He specifically mentioned the style sheets that will be available with V1.5. He mentioned Seaborn's unique abilities (Matplotlib + seaborn + pandas) and beautiful default styles. He highlighted the progress of Bokeh for interactive graphs.
He mentioned Xray. He likes Xray's multi-dimensionality system. He likes Dask's task graphs that enables multi-processing of other things like pandas operations. He mentioned Numba which is a JIT compiler compiling into LLVM code at near C/Fortran speed for a 20X speed increase.
He likes the conda distribution and packaging system because it is like pip but also manages non-python dependcies. It is like like virtualenv but also allows for different compiled versions.
He also gave cheers to Ipython/Jupyter. This tool has been so successful that it has branched out to other languages as well like R and Julia. In fact, Jupyter is short for Julia, Python and R.
Next, Jake dove into the history of the SciPy ecosystem.
I learned that Python was originally developed as a teaching language by Guido Van Rossum in 1980. How did a toy language develop into the preferred language to for scientific computing. Pre-python, a hodgepodge of tools were used. Many people used python as glue to tie together the hodgepodge of different tools (Fortran/C libraries). What drew people to Python was not speed of execution (done by C/Fortran) but by speed of development.
But the efficiency depends on the scientific stack which started with the development of Numeric in 1995. In 1998, Travis Oliphant built wrappers on top of Fortran. In 2002, Perry Greenfield developed Numarray to address issues found in Numeric for larger data sets. Having 2 numeric arrays was a threat to split the nascent community in 2, but Travis managed to unify the to into Numpy. Meanwhile, SciPy was developed in 2000 as a Matlab replacement. In parallel, Fernando Perez started developed Ipython in 2001. And in 2002, Tony Hunter developed matplotlib to port graphical abilities of Matlab to Python. In 2009, Pandas development started which has helped tremendously with Python popularity.
Scikit and conda in 2012. In short, the ecosystem was developed in a federated way over the last 10 years. It took deliberate coherent effort to develop a complementary ecosystem.
Lessons learned:
1.) There is no centralized leadership. What is core in the ecosystem is up to the community. As an example, he is wondering if Numba (not a Continuum project though it was started by them) should be pushed into the core despite many features that still need development.
2.) The most useful ecosystem must be willing to adapt. As an example, Jake mused whether Pandas should be part of the core or Pandas/Xray. If that were the case, then parts of seaborn could be moved into matplotlib. Scypi was supposed to be a monolithic package to get started fast. But he wonders if with conda could scipy should be broken down into its components.
3.) Interoperability with core pieces of other languages. Wow: if graphs can be serialized this enables cross-language collaborations. Jupyter is a prime example. Perhaps, Jake speculated about the importance and feasibility of having a universal data frame that could be understood by Julia, R and Python..
4.) Innovation comes through continuous (e.g.: numpy) and disruptive changes (e.g.: pandas, matplotlib). And both are important.
As an example, he considered the future of matplotlib. There are issues that are well resolved by existing packages. Seaborn addresses non-optimal stylistic defaults. The non-optimal API is addressed by ggplot2. The recent serialization has been addressed by evolving matplotlib to have serialization. The only issue that apparently remains unresolved issue is how large data, and Jake was not sure how this could be resolved. Evolution? Replacing?
Jake concluded the talk by saying that the feature is up to us.
This was THE best keynote because it provided a wide perspective and had a lot of reflection on a wide scale concerning the SciPy ecosystem.
You can read his slide deck here.
Carlos Cordobas, current maintainer of Spyder, is from Continuum Analytics and gave a talk on better documentation for scientific Python. According to him that scientific computing is exploratory computing. So access to documentation and analysis transparently should be easy. Ipython/Jupyter has lots of metadata but it's poorly formatted. So Spyder was developed which is better formated but has poor metadata. The question is how to improve the situation. Docrepr which uses Ipythong metadata and Spyder formatting. It receives docstrings and megtadata as a dictionary. Uses sphinx to parse docstrings as html also adds contents to an html jinja template.
Damin Avila, core developer of Jupyter/Ipython, Bokeh, and Nikola talked about automatic releases based on his work with Bokeh which is a bit complex because it has both a Python and JS side. He described the old way of releasing a version which used conda which was mostly done manually and took hours. His solution of course consisted out of automating the process. For this he used TravisCl.
Stanley Seibert talked about acceleration with the Numba JIT Compiler. Need to review again. He was such a fast speaker but this is a really interesting talk worth watching.
Chris Fonnesbeck from Vanderbilt talked about statistical thinking in data science. Great talk going back to statistical and sampling issues. His main point: Big data does not solve self-selection bias issues. With increased sample size, you just get more precise but inaccurate data back. His advice is to use more model-based inference. Essentially, build models first to estimate things about things we care about first. We want to generate data-generating models specifically. Highly recommended to non-statisticians. I will rewatch this talk myself again as well just to let the material sink in properly.
John Readey talked about restful HDF. I didn't know what HDF5 was and some of the other talks I attended did not really clarify this for me either. This talk did it best by saying that it could be considered several things from C-API, to file format, to data model. John's motivation for development of a restful HDF5 was to enable users to access resources within an HDF5 file with just one simple URL ("Imagine instead of downloading a file and then importing into Ipython notebook, just getting the data with the link.). Other benefits include network transparency as well as the ability for multiple users to read/write HDF5 resources.
I did not attend the remaining talks as I was heading to the airport but all remaining talks can be accessed here:
SciPy 2015 Talks
If you want to read summaries from other days go here:
Day 1 Summary
Day 2 Summary
Comments
Post a Comment