2  Data Understanding

Links:

  1. www.kaggle.com/
  2. datasetsearch.research.google…
  3. data.fivethirtyeight.com/
  4. data.gov/
  5. github.com/search?q=dataset
  6. data.nasa.gov/
  7. selected datasets
Lesson outline
Topic Tasks Activities Student Teacher
1 Finding Data Explore the different sources of data that may be used in data mining, and how to extract and access this data. Think-Pair-Share: students will individually brainstorm potential sources of data, pair up with a partner to discuss, and then share with the class. ‘We learned about various data sources and perspectives of different students during the brainstorming activity.’ ‘Our objective here is to generate a list of possible sources of data that we can use for data mining. As a teacher, I want you to participate actively in brainstorming and support each other’s thoughts. As students, you will be able to collaborate and gain insights from your peers.’
2 Descriptive Statistics Calculate basic descriptive statistics that are commonly used in data mining, and understand how they are used to summarize datasets. Jigsaw: students will be grouped into teams and tasked to gather data from various sources, conduct descriptive statistics, and report their findings to the rest of the class. ‘We learned the importance of teamwork, critical thinking, and communication skills by working together to conduct descriptive statistics on our assigned data set.’ ‘The goal here is to give every student a chance to delve deeper into specific aspects of data mining. As a teacher, my role is to facilitate the group and ensure everyone is participating. As students, you are expected to synthesize, analyze, and present your findings through a collaborative effort.’

ETL stands for Extract, Transform, and Load. It is a process used in data integration and data warehousing to extract data from various sources, transform it into a consistent format, and load it into a target system. The “Extract” phase involves gathering data from multiple sources, such as databases, spreadsheets, or APIs. In the “Transform” phase, the data is cleaned, validated, and standardized to ensure consistency and quality. Finally, in the “Load” phase, the transformed data is loaded into a target system, such as a data warehouse or a business intelligence tool, making it accessible for analysis and reporting. ETL is a crucial step in data management to ensure accurate and usable data for decision-making.

Once you have accessed your dataset you’ll want to get familiar with the content and gain insights into its quality and structure. Data analysts or data scientists collect and examine the data to understand its relevance to the project’s goals. They explore the data using various techniques, such as descriptive statistics, data visualization, and data profiling. The goal is to identify patterns, relationships, and potential issues within the dataset, which helps in formulating initial hypotheses and refining the project’s objectives.

Below is a very trivial example.

import pandas as pd
from IPython.display import display, Markdown
import plotly.express as px

# Extract
url = "https://raw.githubusercontent.com/businessdatasolutions/courses/main/datamining-n/datasets/ai.csv"
rawDF = pd.read_csv(url).sort_values(by=['year', 'Domain'])
# Transform
rawDF = rawDF.sort_values(by=['year', 'Domain'])
# Load
fig = px.scatter(rawDF, x="year", y="Training_computation_petaflop", color="Domain", log_y=True, hover_name="Entity", title="Computation used to train notable artificial intelligence systems")
fig.update_traces(marker={'size': 12})