Tuesday, May 9, 2017

PyData London 2017, write up

This is a post about my experience at PyData London 2017. About what I liked, what I learnt... Note that having 4 tracks, and so many people, my opinions are very biased. If you want to know how your experience would be, it'll be amazing, but different than mine. :)

On the organization side, I think it's been excellent. Everything worked as expected, and when I've got a problem with wifi, I got it fixed literally in couple of minutes by the organizers. It was great to have sushi and burritos instead of last year sandwiches too. The slack channels were quite useful and well organized. I think the organizers deserve a 10, and that's very challenging when organizing a conference.

More on the content side, I used to attend conferences mainly for talks. But this year I decided to try other things a conference can offer (networking, sprints, unconference sessions...). Some random notes:

Bayesian stuff

I think probabilistic models is the are of data science with a higher entry barrier. This is a personal opinion, but shared by many others, including authors:

The Bayesian method is the natural approach to inference, yet it is hidden from readers behind chapters of slow, mathematical analysis. The typical text on Bayesian inference involves two to three chapters on probability theory, then enters what Bayesian inference is. Unfortunately, due to mathematical intractability of most Bayesian models, the reader is only shown simple, artificial examples. This can leave the user with a so-what feeling about Bayesian inference. In fact, this was the author's own prior opinion.

It looks like there is even terminology to define whether the approach used is mathematical (formulae and proofs quite cryptic to me), or computational (more focused on the implementation).

It was luxury to have at PyData once more, Vincent Warmerdam, from the PyData Amsterdam organization. He has been one step ahead of most of us, who are more focused on machine learning (I didn't meet any frequentist so far at PyData conferences). He already gave a talk last year about the topic, The Duct Tape of Heroes: Bayes Rule, which was quite inspiring and make probabilistic models easier, and this year we've got another amazing talk, SaaaS: Sampling as an Algorithm Service.

After that, we managed to have an unconference session with him, where we could see more in detail the examples presented in the talk. While Markov Chain Monte Carlo or Gibbs sampling aren't straight forward to learn, I think we all learnt a lot, so we can finish learning all the details easily by ourselves.

There were other sessions about Bayesian stuff too:

And probably some others that I'm missing, so it looks like the interest on the area is growing, and PyMC3 looks to be the preferred option of most people.

I've got good recommendations of books related to probabilistic models and Bayesian stuff, which shouldn't use the tough approach:

There is a Meetup in London, which is the place to be to meet other Bayesians:

Frequentist stuff

<This space is for sale, contact the administrator of the page>

Topic modeling and Gensim

Another topic that it looks like it's trending is topic modelling, using vector spaces for NLP, and Gensim in particular. Including Latent Dirichlet allocation, one of the most amazing algorithms I've seen in action.

We also got a Gensim sprint during the conference, and we could not only learn about what Gensim does, but also why is a great open source project. In the past I could see how Gensim was able to answer the most similar documents immediately, in a dataset with more than one million samples. While the documentation gives many hints on how Gensim was designed with performance in mind, it was a pleasure to participate in a Gensim sprint, and see the code and the people who make this happen in action.

Amazing also to see how Lev Konstantinovskiy managed to run a tutorial, a talk, a sprint and a lightning talk, during the conference.

From theory to practice

It may be just my impression, but I'd say there have been more talks on applications of data science, and more diverse. While I remember talks on common applications like recommender systems in previous editions, I think it's been an increase on the talks on applications of all these techniques, in different areas.

To name few:
Also, the astronomy/aeroespace communities look to be quite active inside the PyData community

Data activism

Another area which I'd say it's growing is data activism. Or how to use data in a social or political way. We got a keynote on fact checking, and another about analyzing data for good, to prevent money laundry with government information.

DataKind UK looks to be the place to be, to participate on this efforts.

Pub Quiz

That awkward moment when you thought you knew Python, but James Powell is your interviewer...

Ok, it wasn't an interview, it was a pub quiz, but the feeling was somehow similar. 10 years working in Python, I passed challenging technical interviews for companies such as Bank of America or Google, and at some point you start to think you know what you're doing.

Then, when you're relaxed in a pub, after and amazing but exhausting day, James Powell starts running the pub quiz, and you feel that you don't know anything about Python. Some new Python 3 syntax, all time namespace tricks, and so many atypical cases...

Luckily, all the dots started to connect, and I realized that few hours before, I was discussing with Steve Holden about the new edition of his book Python in a Nutshell. Which sounded like an introduction to me, but it looks like it provides all Python internals.

Going back to the pub quiz, I think it's one of the most memorable moments in a conference. Great people, loads of laughs, and an amazing set of questions perfectly executed.

Big Data becoming smaller

As I mentioned before, my experience at the conference is very biased, and very influenced by the talks I attended, the people I met... But my impression is that the boom on big data (large deep networks, spark...) is not a boom anymore.

Of course there is a lot of people working with Spark, and researching in deep neural networks, but instead of growing, I felt like these things are loosing momentum, and people is focusing on other technologies and topics.

Meetup groups

One of the things I was interested in, was on finding new interesting meetups. I think among the most popular ones in data science are:

But I met many organizers of other very interesting meetups at the conference:

To conclude, there are couple of tools/packages I discovered, that seemed everybody else was aware of.

It looks like  at some point, instant messaging of most free software projects moved from IRC to gitter. There you can find data science communities, like pandas, scikit-learn, as well as other non data science, like Django. 

A package that many people seems to be using, is tqdm. You can use it over an iterator (like enumerate), and it shows a progress bar while the iterations is running. Funny, that besides being an abbreviation of progress in Arabic, i's an abbreviation for "I want/love you too much" in Spanish.

What's next?

Good news. If you couldn't attend PyData London 2017, or you didn't have enough of it, there are some things you can do:
  • Attend PyData Barcelona 2017, which will be as amazing as PyData London, also in English, and with top speakers like Travis Oliphant (author of scipy and numpy) or Francesc Alted (author of PyTables, Blosc, bcolz and numexpr).
  • Wait until the videos are published in the PyData channel (or watch the ones from other PyData conferences)
  • Join one of the 55 PyData meetups around the world, or start yours (check this document to see how, NumFOCUS will support you).
  • Join one of the other conferences happening later this year in Paris, Berlin, EuroPython in Italy, Warsaw... You can find all them at https://pydata.org/

No comments:

Post a Comment