Data Science

NVIDIA Deep Learning Institute Releases Accelerated Data Science Teaching Kit

The NVIDIA Deep Learning Institute (DLI) recently released the Accelerated Data Science Teaching Kit, co-developed with Professor Polo Chau from Georgia Institute of Technology and Professor Xishuang Dong from Prairie View A&M University. 

The comprehensive teaching materials cover fundamental and advanced topics in data collection and pre-processing, accelerated data science with RAPIDS, scalable and distributed computing, GPU-accelerated machine learning, data visualization and graph analytics, and addresses the growing need of teaching data science skills to students in higher education and research institutions.

The Accelerated Data Science Teaching Kit includes focused modules covering:

  • Introduction to Data Science and RAPIDS
  • Data Collection and Preprocessing (ETL)
  • Data Ethics and Bias in Data Sets
  • Data Integration and Analytics
  • Data Visualization
  • Scalable Computing with Hadoop, Hive, Spark, HBase and RAPIDS
  • Scalable Computing with Dask and UCX
  • Machine Learning: Classification
  • Machine Learning: Clustering and Dimensionality Reduction
  • Neural Networks
  • Graph Analytics 
  • Streaming Data 
  • Genomics 
  • Text Analytics 
  • CPU vs GPU-Accelerated Data Science 
  • Data Science Teams, Code Back-up and Version Control 
  • Team Project (Fake News Detection)

The kit also covers culturally-responsive topics such as fairness and data bias, as well as challenges and important figures from underrepresented groups.

Lecture slides and notes, hands-on labs, Jupyter notebooks, solutions (held in private repo), sample data sets, quiz/exam questions/answers, GPU compute resources via free AWS cloud credits, and free DLI online courses/certificates are all included. Lecture videos are planned for the next release.

The RAPIDS data science framework is a GPU-accelerated collection of libraries for executing end-to-end data science pipelines completely on the GPU. The primary objective behind using RAPIDS is to accelerate individual parts of the typical data science workflow, and thereby accelerating the complete end-to-end workflow in Data Preparation and Machine Learning. 

One of the first Jupyter notebook-based labs has students dive right into RAPIDS using pandas and cuDF. Pandas is a data analysis and manipulation tool built on top of the Python programming language to perform various tasks (e.g.: loading, joining, aggregating, and filtering data). cuDF is a RAPIDS-based GPU DataFrame library that helps perform similar functionalities with GPU acceleration. 

Students are first tasked with understanding how to create DataFrame objects in cuDF, assigning values to those objects, and then calling methods and applying user-defined functions on the values. Once students grasp working with cuDF DataFrames, they are tasked with creating one from a Netflix movie dataset from Kaggle. 

Figure 1. Snapshot of Teaching Kit Module 1: Intro to RAPIDS Lab.

From there, students learn how to manipulate and interrogate the data, from dropping missing columns and values, querying, and finding unique values, to sorting, counting and grouping data. Students will get a feel for how fast and easy it is by using RAPIDS and GPUs versus traditional methods that are also covered in the Teaching Kit. As a bonus task in the lab, students are finally asked to use cuDF One-hot encoding to convert the data set’s movie and TV show titles to vectors of 0s and 1s to improve the accuracy of analyzing the data.

“Data Science unlocks the immense potential of data in solving societal challenges and large-scale complex problems across virtually every domain, from business, technology, science, engineering, healthcare, to government, and many more,” Professor Chau said. “As data continues to grow in volume, velocity and complexity, there is an ever-increasing demand for data science talent and skill sets to help design the best solutions.”

This is the fourth Teaching Kit as part of the existing program of over 8,000 qualified educators.

To learn more about the Data Science Teaching Kit, listen to the NVIDIA On-Demand session where the co-developers, Professors Polo Chau and Xishuang Dong, share how they leverage the content and GPU resources for their classes at Georgia Tech and Prairie View A&M University.

Get started with NVIDIA Teaching Kits >> 

Discuss (0)

Tags