Module 2 Business Problems and Data Science Solutions

2.1 Opening Business Case

In the following video you’ll hear an expert’s thoughts on the relation between business problems and data challenges.

Assignments:

  • Write down all associations you can discover between the study material (Provost Ch.2) and the video.
  • What is the relation between business problems and data challenges?
  • What questions could be asked during data understanding?

2.2 Supervised, Unsupervised and Reinforcement Methods

2.2.1 Supervised vs Unsupervised Methods

2.2.2 Reinforcement Methods

2.3 The Data Mining Process

2.3.1 Formulating a business problem

Business Case: Recognizing handwritten digits

Table 2.1: ECLIPSE questions taken from Wildridge and Bell (2002)
Mnemonic Question Answer
E Expectation: Why does the user want the information? A utility company wants to scan the electricity meter readings it receives from clients through paper forms and save them to a database
C Client Group: For whom is the service / product intended? The Billing Department
L Location: Where is the service / product sited? This is an internal process
I Impact: What would represent success? How is this measured? To reach at least human accuracy
P Professionals: Who provides or improves the service / product? At the moment this is done by hand
SE Service: What type of service / product is under consideration? The readings are used to prepare invoices

2.3.2 Data Understanding

We are going to train a model on data from the MNIST database (LeCun 1998). The original dataset contains 60000 images of handwritten digits. The size of each images is 28x28 pixel and each pixel has a grayscale value between 0 (white) en 255 (black).

Example taken from @robinson_exploring_2018

Figure 2.2: Example taken from Robinson (2018)

2.3.3 Data Preparation

How much gray is there in the set of images?

Example Mnist Robinson (2018)
Example taken from @robinson_exploring_2018

Figure 2.3: Example taken from Robinson (2018)

Solution

Before checking, analyze the chart and draw your own conclusions.


Most pixels in the dataset are completely white, along with another set of pixels that are completely dark, with relatively few in between.

Think: Let’s assume the images were black and whit photographs of landscapes or faces. Would you expect the same color distribution as above?


How much variability is there within each digit label? Do all 3s look like each other, and what is the “most typical” example of a 6?

To answer this, we can find the mean value for each position within each label and visualize these average digits as ten separate facets. These averaged images are called centroids. We’re treating each image as a 784-dimensional point (28 by 28), and then taking the average of all points in each dimension individually.

Example taken from @robinson_exploring_2018

Figure 2.4: Example taken from Robinson (2018)

Solution

Before checking, analyze the chart and draw your own conclusions.


Already we have some suspicions about which digits might be easier to separate. Distinguishing 0 and 1 looks pretty straightforward: you could pick a few pixels at the center (always dark in 1 but not 0), or at the left and right edges (often dark in 0 but not 1), and you’d have a pretty great classifier. Pairs like 4/9, or 3/8, have a lot more overlap and will be a more challenging problem.

Which digits have more variability on average?

So far, this machine learning problem might seem a bit easy: we have some very “typical” versions of each digit. But one of the reasons classification can be challenging is that some digits will fall widely outside the norm. It’s useful to explore atypical cases, since it could help us understand why the method fails and help us choose a method and engineer features.

In this case, we could consider the Euclidean distance (square root of the sum of squares) of each image to its label’s centroid.

Example taken from @robinson_exploring_2018

Figure 2.5: Example taken from Robinson (2018)

Solution

Before checking, analyze the chart and draw your own conclusions.


It looks like 1s have especially low distances to their centroid: for the most part there’s not a ton of variability in how people draw that digit. It looks like the most variability by this measure are in 0s and 2s. But every digit has at least a few cases with an unusually large distance from their centroid. I wonder what those look like?

To discover this, we can visualize the six digit instances that had the least resemblance to their central digit.

Example taken from @robinson_exploring_2018

Figure 2.6: Example taken from Robinson (2018)

This is a useful way to understand what kinds of problems the data could have. For instance, while most 1s looked the same, they could be drawn diagonally, or with a flat line and a flag on top. A 7 could be drawn with a bar in the middle. (And what is up with that 9 on the lower left?)

This also gives us a realistic sense of how accurate our classifier can get. Even humans would have a hard time classifying some of these sloppy digits, so we can’t expect a 100% success rate. Conversely, if one of our classifiers does get a 100% success rate, we should examine whether we’re overfitting!

How easy it is to tell pairs of digits apart?

To examine this, we could try overlapping pairs of our centroid digits, and taking the difference between them. If two centroids have very little overlap, this means they’ll probably be easy to distinguish.

Example taken from @robinson_exploring_2018

Figure 2.7: Example taken from Robinson (2018)

Solution

Before checking, analyze the chart and draw your own conclusions.


Pairs with very red or very blue regions will be easy to classify, since they describe features that divide the datasets neatly. This confirms our suspicion about 0/1 being easy to classify: it has substantial regions than are deeply red or blue.

Comparisons that are largely white may be more difficult. We can see 4/9 looks pretty challenging, which makes sense (a handwritten 4 and 9 really differ only by a small region at the top). 7/9 shows a similar challenge.

2.3.4 Modeling and Evaluation

For modeling we can use a neural network. Follow this link and let thee model run. When finished you’ll see a table resembling the following:

Example Machine Learning output

Figure 2.8: Example Machine Learning output

How well does the model perform?

2.3.5 Deployment

MNIST Handwriting

2.4 Data Formats

2.4.1 SQL vs NoSQL


If you work with Excel or store your data in MySQL, you are used to storing data in rows and columns. For example, if you store customer data:

Table 2.2: A data table
First Last
Witold ten Hove
Caroline Tensen

Each line is a customer, each column a variable, and each cell contains one value.

However, suppose you want to store all of your customers’ phone numbers and some customers have multiple phone numbers. One option is to store all telephone numbers in one cell:

Table 2.3: A data-table with mutiple values in one cell
First Last Telephone
Witek ten Hove 06111111; 06222222
Caroline Tensen 06333333

However, this can cause a problem when exporting or searching the data. As soon as your data seems to become asymmetric, you should consider a NoSQL format. The most important formats in terms of popularity are XML and JSON.

2.4.1.1 XML

By annotating data, meaning can be added to individual data elements. One data format that makes use of this is Extensible Markup Language (XML). For example, in our example, a tag can be added to the data relating to the first name and each telephone number can be individually identified as such.

## 
## <customers>
##   <customer>
##     <First>Witek</First>
##     <Last>ten Hove</Last>
##     <Telephone>06111111</Telephone>
##     <Telephone>06222222</Telephone>
##   </customer>
##   <customer>
##     <First>Caroline</First>
##     <Last>Tensen</Last>
##     <Telephone>06333333</Telephone>
##   </customer>
## </customers>

Using this method we can easily add data. For instance a second name.

## 
## <customers>
##   <customer>
##     <First>Witold</First>
##     <First>Frederik</First>
##     <Last>ten Hove</Last>
##     <Telephone>06111111</Telephone>
##     <Telephone>06222222</Telephone>
##   </customer>
##   <customer>
##     <First>Caroline</First>
##     <First>Egbertine</First>
##     <Last>Tensen</Last>
##     <Telephone>06333333</Telephone>
##   </customer>
## </customers>

2.4.1.2 JSON

For a long time XML was the leading data format for web applications. However is has currently been overtaken by Javascript Object Notation (JSON).

Source: <a href="https://insights.stackoverflow.com/trends?tags=json%2Cxml" target="_blank">Stack Overflow</a>

Figure 2.9: Source: Stack Overflow


Figure 2.10: Percentage web searches JSON vs XML
Source: Google Trends

The main reason for this change has been the increase in popularity of so-called Application Programming Interfaces (APIs). Many online service providers offer also the possibility to access selected data. Most modern web applications are written using Javascript and can process JSON natively.

This concept simplifies the integration of various data sources dramatically. For instance one application could combine tweets from Twitter with weather forecasts and with news from CNN.

JSON works through so-called key: value sets:

## [
##   {
##     "First": "Witold",
##     "Last": "ten Hove",
##     "Telephone": "06111111"
##   },
##   {
##     "First": "Caroline",
##     "Last": "Tefsen",
##     "Telephone": "06333333"
##   }
## ]

If a value consists of many elements these are stored in arrays.

## [
##   {
##     "First": ["Witold", "Frederik"],
##     "Last": "ten Hove",
##     "Telephone": ["06111111", "06222222"]
##   },
##   {
##     "First": ["Caroline", "Egbertine"],
##     "Last": "Tefsen",
##     "Telephone": "06333333"
##   }
## ]

The arrays are indexed, so each element can be easily retrieved.

2.4.2 Graph

2.5 Business Case

Big Data is not reserved for Big Business. In the following video you’ll see a case of a small retailer using data to improve his business decisions.

Answer the following questions. Use the information from the course literature as a starting point for your argumentation. Find other reliable sources that support your analysis.

  • What was the business problem?
  • How was it translated into a data mining problem?
  • What is the realized business value?

References

LeCun, Yann. 1998. “The MNIST Database of Handwritten Digits.” Http://Yann. Lecun. Com/Exdb/Mnist/.
Robinson, David. 2018. “Exploring Handwritten Digit Classification: A Tidy Analysis of the MNIST Dataset.” R-Bloggers. https://www.r-bloggers.com/2018/01/exploring-handwritten-digit-classification-a-tidy-analysis-of-the-mnist-dataset/.
Wildridge, Valerie, and Lucy Bell. 2002. “How CLIP Became ECLIPSE: A Mnemonic to Assist in Searching for Health Policy/Management Information.” Health Information & Libraries Journal 19 (2): 113–15. https://doi.org/https://doi.org/10.1046/j.1471-1842.2002.00378.x.