The term “data scientist” has only been around for a few years: Apparently, it was coined in 2008 by either D.J. Patil or Jeff Hammerbacher - the then respective leads of data and analytics at LinkedIn and Facebook. Since then, data scientists have swiftly won in influence.
Only four years later, the Harvard Business Review called data science the sexiest job of the 21st century. A recent report by IBM (The Quant Crunch) now predicts that the demand of data science skills in the US will grow by 28% by 2020.
But despite the growing need for data scientists, there is no direct path to follow to become one. It is also not that easy to understand what exactly they do in different companies. What is the difference between business intelligence, data analysis and data science, for example? What is the unique contribution of a data science team in a company? What do companies look for when hiring data scientists?
To answer these questions, we talked to Jaco du Toit - an accidental data scientist who was introduced to machine learning while pursuing his Master’s Degree in Computer Science. Earlier this year, he joined Curately as Lead Data Scientist and is also busy with a PhD focused on Probabilistic Graphical Models.
In addition, data scientists from takealot.com (Luyolo Magangane and Michel Halmes), Zoona (Morne van der Westhuizen) and JUMO (Paul Kotze and Liam Furman) contributed to this article by giving an in-depth explanation of the work they do at these companies.
The need for data scientists
A data scientist’s role falls somewhere between that of a developer and that of a statistician. A definition by Josh Wills, Director of Data Engineering at Slack, encapsulates this and has become well-known. He describes a data scientist as a “person who is better at statistics than any software engineer and better at software engineering than any statistician”.
Information is becoming available in volumes and varieties like never before, giving rise to a competitive landscape: Business success is going to be largely dependent on how companies are going to be able to turn data into insights and actually take action on it. This is where data scientists come in: They have the skills to bring structure to large quantities of unstructured data which makes deep analysis possible. This helps decision makers to have open ended discussions about the data at their disposal.
Data scientists are often required to create user facing products from data and integrate it with existing platforms. “In such cases, it will either be self-developed, handed over to developers, or jointly built with a developer,” Jaco says.
Skills and background to become a data scientists
Despite the clear need for data scientists, becoming one is not straightforward: “Apart from short courses presented by a few companies and universities in South Africa, there is not one clear qualification path that will enable you to become a data scientist,” Jaco du Toit explains. (The good news is that Stellenbosch University is said to be working on a degree program.)
Despite these constraints, potential data scientists can be found in a variety of disciplines with a strong data and computational focus: computer science, actuarial sciences, engineering, statistical sciences, mathematics and applied mathematics, operations research, astronomy and quantum physics.
“An advanced degree in a quantitative field is an advantage and a minimum requirement for many well known international companies hiring data scientists,” Jaco states.
According to him, data scientists can either have a wide range of generic skills or a more specialised skill set - in certain algorithms or tools, for example. A data scientist can also become an industry specific expert, which means they would have experience with certain types of data sets.
Here are some of the skills that companies are looking for when hiring data scientists:
- Proficiency in R, Python (or other high level languages like Java, C, C++, C#), and SQL
- Knowledge of a useful data stack like Apache Kafka
- Experience with MySQL or Postgres
- Background in scientific, mathematical or quantitative fields such as physics, statistics or engineering
- Ability to solve complex problems
- Ability to effectively infer and describe insights from data sets
- Ability to articulate oneself in a clear, concise manner
- Solid understanding of some linear and nonlinear machine learning paradigms, for example neural networks, evolutionary computing or probabilistic reasoning.
How data science is applied in specific companies
takealot.com
Takealot.com is South Africa’s largest e-commerce platform. Since its inception in June 2011, the company has acquired Mr Delivery and Superbalist and merged with kalahari.com.
Takealot’s data science team is an integrated part of the engineering team. They are responsible for designing, implementing and maintaining the system in production. Projects often involve other engineering and business teams.
The kinds of data problems and opportunities they are facing
Many of takealot.com’s challenges are too complex to solve using traditional computing algorithms (or humans). A good example of this is the problem of defining a user’s intent when searching an e-commerce platform. There are various mapping techniques that can translate words into a list of products, but they fall short of understanding the meaning behind a query. This is where learning algorithms become useful. They can build relationships between words that, in turn, infer the intention behind a user’s request. This functionality is crucial in any e-commerce platform, helping users find what they’re looking for.
In an environment where data points are moving with high velocity (such as product stock levels), high variety (such as the diversity in takealot.com’s product catalogue) and in high volume (millions of user visits per month), it is advantageous to make use of self-adapting algorithms that can answer difficult questions such as:
- Based on a user's order history, what other products may they be interested in?
- How long should a product be kept in the warehouse?
- How should discounts be applied to a customer’s order based on what is in that order?
- Are users loyal to takealot.com?
In the takealot.com team’s experience, people who want to be data scientists are often interested in the modelling part of the work. But in a commercial environment, it is crucial that data scientists can also engineer their solutions into a working system. That means that data scientists could easily spend up to 70% of their time implementing systems.
Examples of solutions they are implementing
Takealot.com’s search function allows customers to find the products they’re looking for via search queries. This does not necessarily require machine learning as it is an implementation of document retrieval (full text search). The team does however use machine learning to rank the relevant products: A specific query can match hundreds of products in the catalogue, for example, and vary a lot. A search may match a book title, a perfume and a DVD, all with similar words in the title. The team’s challenge is to figure out which ones the customer is most likely to be interested in. They use algorithms to infer these preferences from past user-behavior.
Recommendations suggest products to the customer in a specific context. This context can be created in several ways: a product or a category the customer is currently viewing or even several products that are currently in their shopping cart. The recommender systems are based on algorithms that fall into the category “collaborative filtering”.
Both the search function and the recommendation engine work with discoverability algorithms that require a lot of customer behaviour data. They need a way of collecting and storing this data, and this too is the responsibility of the data science team: setting up infrastructure for data capturing, pipelining and warehousing. They use a real-time system where events are produced on a messaging queue by various components of the takealot.com stack: the mobile and web apps, the warehouse, etc.
The events are then consumed, processed, stored and eventually used by the Machine Learning team for their algorithms but also other teams like Product and Logistics. In fact, the entire organisation should be able to have access to all this data, so the setup of supporting services around the infrastructure such as visualisation dashboards, data querying interfaces, and reporting mechanisms is important.
The tools they are using
Takealot.com has a fluid stack that changes according to requirements of the company and the data science team. These are the tools they’re currently using:
- Python is one of the most data science friendly languages. There are a number of libraries and services that help with experimentation and development. Some of the libraries they use extensively include Numpy and Scipy as well as numerical and statistical processing libraries.
- Traditional relational databases are also prominent in their tool set. These encompass MySQL, PostgreSQL and SQLite. They use these in conjunction with Python scripts to test new adaptive models, and then potentially scale these solutions using distributed datastore clusters. Some of those include SparkSQL and Redshift.
- Data collection is probably the most important aspect of building machine learning models. Apache Kafka has shown incredible versatility for data capturing. It can be used for user activity tracking, operational performance logging, inter-service communication and stream processing. They then use Amazon S3 and Amazon Redshift as long-term data storage. It can also be useful to store data in fast read access document stores. This is primarily applicable when all processing for a machine learning model has been completed in an offline batch processing task, for example a Spark job. Their tool of choice in this arena is Redis - a fast, distributed, in-memory data store.
- Arguably the most important part of an e-commerce platform is improving the capability of users to find things on your website. Elasticsearch along with the Elastic Stack provides a mechanism for relating user search queries to products within a catalogue. It’s not only useful for the indexing and querying of product data, but can also be used for system monitoring through its Kibana interface.
- A tool they use for ad-hoc analysis and prototyping is called Jupyter Notebook. It allows interactive development of algorithms and data visualizations. Finally, one can easily generate reports in variety of file formats following development.
Zoona
Zoona is a mobile payments company partnering with local entrepreneurs to provide a variety of over-the-counter (OTC) solutions: person-to-person money transfers, salary payouts, bill payments and supplier payments. Their customers are mainly unbanked consumers and small businesses in Zambia, Malawi and Mozambique.
Like their developer team, the data science team is product focused and customer-centric: they add value not by creating reports or presentations for senior management but by directly giving input to improve customer-facing products and processes. Their solutions are focussed on helping Zoona’s communities thrive in the relevant markets.
The kinds of data problems and opportunities they are facing
70% of consumers in Zambia are typically unbanked, so designing financial products is particularly difficult due to lack of traditional reference documents for credit scoring. Data sources are very sparse and usually outdated. Therefore, Zoona has to be really creative in the usage of alternative sources.
Network analysis can be a useful tool in this endeavour. It helps Zoona to estimate how trustworthy a user is based on who they know - specifically if they interact or transact with people Zoona knows more about.
Geospatial data is particularly valuable to Zoona’s data scientists. They use the data to determine financial inclusion and market penetration. Nevertheless, given the markets in which they operate, there is one significant problem with the geospatial data available to them: it is not granular enough. To address this problem, they send out an advance team to scope the landscape whenever they move into a new area. This team logs physical landmarks such as markets and bus stops. Because money flows traditionally follow bus or taxi routes, it can be helpful to get into a local taxi and travel on key routes.
Examples of solutions they are implementing
Geospatial modelling: The team uses geospatial points, NASA night lights, population statistics and access to financial points to determine the best possible locations for expansion, and to aid in financial inclusion in previously unbanked communities.
Revenue forecasting: Machine learning is used to predict the company's revenue 6 months in advance. An ensemble of gradient-boosted decision trees is trained on each Zoona-agent's historical performance, geospatial location and seasonal effects. This is then fed into a series of Monte Carlo simulations, which try different scenarios of rollout plans and other business effects to get the most likely revenue.
The team also focuses on credit line calculation and credit scoring, including algorithms that can predict the possibility of consumer defaulting. To predict optimal expansion in existing markets, they use Monte Carlo simulations and a predator-prey model to cater for saturation. For consumer churn analyses, they use social network and LSTMs.
The tools they are using
- Machine Learning: Keras, Scikit-Learn, XGBoost, Statsmodels, PyMC3, SpaCy, NLTK, Dask
- Network analysis: NetworkX, Neo4J
- Geospatial: Geopandas, Iris, Fiona, OSMnx, Graphhopper, Proj4
- Web scraping/text analysis: Scrapy, Selenium,
*BeautifulSoup, Fuzzywuzzy - Image processing: Scikit-Image, OpenCV
- Visualisation: Matplotlib, D3, Bokeh, Cartopy
JUMO
JUMO is a financial services marketplace currently operating in six African countries. Using behavioural data from mobile usage, their predictive technology creates highly accurate credit scores and enables customers to access responsive savings as well as working capital products on their mobile phones. Targeting emerging markets, 80% of JUMO’s customers interact with financial services for the first time through their platform.
The data science team works closely with three other teams within JUMO. It collaborates with:
-
the Product team to provide valuable insights to design products that better fit customers and their needs,
-
the Portfolio Management team to provide impact measures of their products on customers and
-
the Data Architecture team to design and implement big data architecture and tools to scale their big data capability.
In general, the team’s goal is to accurately estimate their customers’ affordability based on their mobile phone usage, regional characteristics and other features. This is impossible without intimately understanding their financial context.
The kinds of data problems and opportunities they are facing
A common request from JUMO’s partners is to measure the effect of JUMO-products on their businesses. When they offer loans to their partners’ customers, for example, they need to understand the effect it has on ARPU (Average Revenue Per User) - an important metric for MNOs (mobile network operators). Ideally, JUMO would estimate the effect of their products through randomised controlled trials. However, this isn’t always possible when commercial considerations come into play. Instead, the data science team has developed a framework for statistical matching.
Examples of solutions they are implementing
Data science is involved in several deep-dive analyses to investigate JUMO’s effect on the mobile money ecosystem. They are able to quantify their financial impact by calculating additional revenue generated using empirical statistical models and Markov chains. They also analyse their partners’ data to help them optimise their agent networks. To accommodate the demand for mobile money, for example, they identify regions with high mobile money penetration and low agent density. They then make recommendations about where their partners should deploy more agents. They also investigate the agents’ cash-flows to build products that allow their partners’ agents to serve customers better.
Using predictive modeling, the JUMO team identifies informal merchants by looking at mobile money transactional data. The accuracy of these predictions has been confirmed through face-to-face, qualitative field work by their customer intelligence team. Moreover, the data science team builds features that predict customers’ ability to settle their loans, and these are used in credit scoring models.
The team is also involved in building internal and external tools. They are developing several data portal products for internal consumption. One of these is a portfolio portal that provides a cohorted view of financial data on which they have built forecasts. This allows JUMO’s portfolio teams to monitor changes in product performance over time and estimate future performance. In addition, they are building tools that offer external partners better transparency and controls over their ecosystems.
The tools they are using
- Basic data science tech stack: A combination of Jupyter Notebooks, Python, R and SQL for many of their analyses. Spark plays an increasingly important role for analysis of big data.
- Cloud infrastructure: AWS, EC2 and EMR.
Curately (by Piccing)
Curately is a visually driven social platform that enables users to explore an image: They can buy products tagged in an image, watch videos about tagged items or read content about tagged items. Its aim is to create new visual ways for people to experience information by facilitating interactions between consumers, brands, and publishers.
Curately is still in the early stages of building a data science team. For now, the members mainly collaborate with two other teams: the team creating and managing the content on the platform to ensure that the data gets standardised, and the developer team to help productionise the algorithms.
Examples of solutions they are working on
One example for this relates to computer vision: The data science team is currently building a unified object localisation and recognition framework. In a given image, it has to detect all relevant objects, higher level themes and be able to localise them. This information is used to drive user engagement, suggest relevant content and streamline some of their back-end processes.
The aim of Curately’s recommendations feature is to map relevant products to objects detected in the images using probabilistic methods. The team is also investigating methods that incorporate latent behavioural factors, which should allow them to explore more complex assumptions rather than generalising over aggregated cohorts. They envision this to produce a more natural and unbiased personalisation experience.
They also use anomaly detection to ensure the user experience is not affected by inappropriate content. For general platform optimisation, they rely on user model interaction for feedback to improve model performance.
The tools they are using
The Curately team uses and experiments with traditional frequentist machine learning techniques and practices (e.g. deep neural networks, convolutional neural networks, recurrent neural networks and deep generative models), with a focus on linking them with probabilistic graphical methods. In general, they welcome any approach in accordance to the “no free lunch” theorem, as long as it adheres to the principle of parsimony and generalises well over the true underlying data-generating distributions. The tools they use include:
- Python, C, SQL
- Pandas, Matplotlib, Numpy, Scikit-learn, Pillow, Tensorflow, Pgmpy
- CUDA , and GPU-accelerated libraries such as cuDNN
Recommended data science resources
Books
“There has recently been an explosion in books published by O’Really Media,” Jaco says. “These books make for an easy read and some include easy-to-follow practical examples and coding exercises.”
Apart from these books, Jaco also recommends the following:
- Hands-On Machine Learning with Scikit-Learn and TensorFlow by Aurélien Géron provides a great overview of the field.
- Bayesian Reasoning and Machine Learning by David Barber
- Probabilistic Graphical Models Principles and Techniques by Daphne Koller and Nir Friedman,
- Pattern Recognition and Machine Learning by Christopher Bishop
- Causality: Models, Reasoning and Inference by Judea Pearl
- Deep Learning by Ian Goodfellow, Yoshua Bengio and Aaron Courville.
- The Personal MBA by Josh Kaufman provides a clear and comprehensive overview of the most important business concepts that a good data scientist should know. “I found many of the mental models discussed in the book useful in asking more relevant business questions, and being more aware of typical cognitive biases,” Jaco says.
- Analyzing the Analyzers: An Introspective Survey of Data Scientists and Their Work by Harlan Harris, Sean Murphy and Marck Vaisman. The book is the result of a survey of several hundred data scientists in mid-2012 about how they view their skills, careers, and experiences with prospective employers.
Online courses and videos
- Machine Learning by Stanford University via Coursera
- Machine Learning by Columbia University via edX
- Probabilistic Graphical Models by Stanford University via Coursera
- Neural Networks for Machine Learning by the University of Toronto via Coursera
- Building a Data Science Team by John Hopkins University via Coursera
- Bill Howe’s Data Science Courses by University of Washington via Coursera
- Videolectures.net covers many topics in more detail
- Nando de Freitas also shares his classes on YouTube, covering a wide range of topics in machine learning with detailed mathematical explanations
- Artificial Intelligence Nanodegree via Udacity
- Kaggle is fun if you want to practice machine learning
- Deep learning via Coursera