Dominic Nabiga

How to Use ChatGPT As A Data Scientist

How to Use ChatGPT As A Data Scientist

Introduction

The whole world has heard of ChatGPT, yet according to a recent study, only 2% of people use it daily. For data scientists, integrating ChatGPT into their daily workflow can significantly enhance productivity. This article explores various ways data scientists can use ChatGPT to improve their output.

Studying & Learning

One of the most valuable features of ChatGPT is its ability to simplify complex concepts. Using the prompt “explain like I’m 5” (ELI5), data scientists can grasp complicated topics in an easy-to-understand manner. For instance, asking ChatGPT to explain recurrent neural networks (RNNs) in simple terms helps build intuition.

Example

Prompt: Explain to me recurrent neural networks like I’m 5 years old.

Response:

By providing such intuitive explanations, ChatGPT acts like a personal tutor, saving time and helping data scientists understand concepts deeply.

Unit Tests

Writing unit tests can be tedious, but they are essential for ensuring robust code. ChatGPT can expedite this process by generating testing templates that can be customized for specific edge cases.

Example

Function:

def fibonacci_sequence(n):
    if n <= 0:
        return []
    elif n == 1:
        return [0]
    elif n == 2:
        return [0, 1]

    sequence = [0, 1]

    for i in range(2, n):
        next_fibonacci_number = sequence[-1] + sequence[-2]
        sequence.append(next_fibonacci_number)

    return sequence

Prompt: Write unit tests for the Fibonacci sequence function.

Generated Unit Tests:

import unittest

def fibonacci_sequence(n):
    # Function code here...

class TestFibonacciSequence(unittest.TestCase):
    def test_negative_input(self):
        self.assertEqual(fibonacci_sequence(-1), [])

    def test_zero_input(self):
        self.assertEqual(fibonacci_sequence(0), [])

    def test_one_input(self):
        self.assertEqual(fibonacci_sequence(1), [0])

    def test_two_input(self):
        self.assertEqual(fibonacci_sequence(2), [0, 1])

    def test_five_input(self):
        self.assertEqual(fibonacci_sequence(5), [0, 1, 1, 2, 3])

    def test_ten_input(self):
        self.assertEqual(fibonacci_sequence(10), [0, 1, 1, 2, 3, 5, 8, 13, 21, 34])

if __name__ == '__main__':
    unittest.main()

Running these tests confirms their correctness and saves considerable time in setup and execution.

Creating Visualizations

Creating detailed visualizations can be time-consuming. ChatGPT simplifies this by generating the necessary Python code to create plots.

Example

Using data from Kaggle, data scientists can input the dataset into ChatGPT and request visualizations.

Prompt: Create a bar chart of show_id by country.

Input Data:

Generated Plot and Code:

import matplotlib.pyplot as plt
import pandas as pd

data = pd.read_csv('path_to_your_data.csv')

data['country'].value_counts().head(10).plot(kind='bar')
plt.title('Show ID by Country')
plt.xlabel('Country')
plt.ylabel('Show ID Count')
plt.show()

Refining the prompt can improve the plot by focusing on the top 10 countries.

Refactoring Code

Refactoring code to adhere to best practices is another area where ChatGPT excels. Data scientists can input their code and ask ChatGPT to make it “production standard.”

Example

Original Code:

import plotly.express as px
import pandas as pd
import os
import numpy as np

data = pd.read_csv('AirPassengers.csv')

def plotting(title, data, x, y, save_file_path, x_label, y_label):
    fig = px.line(data, x=data[x], y=data[y], labels={x: x_label, y: y_label})
    fig.update_layout(template="simple_white", font=dict(size=18),
                      title_text=title, width=650,
                      title_x=0.5, height=400)
    if not os.path.exists("../images"):
        os.mkdir("../images")
    fig.write_image("../images/" + str(save_file_path))
    fig.show()

plotting('Airline Passengers', data, 'Month', '#Passengers', 'passengers.png', 'Date', 'Passengers')
data["Passenger_Diff"] = data["#Passengers"].diff()
plotting('Airline Passengers', data, 'Month', 'Passenger_Diff', 'passengers_one_difference.png', 'Date', 'Passengers<br>Difference Transform')
data["Passenger_Log"] = np.log(data["#Passengers"])
plotting('Airline Passengers', data, 'Month', 'Passenger_Log', 'passenger_log.png', 'Date', 'Passengers<br>Log Transform')
data["Passenger_Diff_Log"] = data["Passenger_Log"].diff()
plotting('Airline Passengers', data, 'Month', 'Passenger_Diff_Log', 'passenger_difference_and_log.png', 'Date', 'Passengers<br>Log and Difference')

Refactored Code:

import plotly.express as px
import pandas as pd
import os
import numpy as np

data = pd.read_csv('AirPassengers.csv')

def plot_data(title, data, x, y, save_file_path, x_label, y_label):
    fig = px.line(data, x=x, y=y, labels={x: x_label, y: y_label})
    fig.update_layout(template="simple_white", font=dict(size=18), title_text=title, width=650, title_x=0.5, height=400)
    os.makedirs("../images", exist_ok=True)
    fig.write_image(f"../images/{save_file_path}")
    fig.show()

plot_data('Airline Passengers', data, 'Month', '#Passengers', 'passengers.png', 'Date', 'Passengers')

transforms = [
    ("Passenger_Diff", data["#Passengers"].diff(), 'passengers_one_difference.png', 'Passengers<br>Difference Transform'),
    ("Passenger_Log", np.log(data["#Passengers"]), 'passenger_log.png', 'Passenger<br>Log Transform'),
    ("Passenger_Diff_Log", np.log(data["#Passengers"]).diff(), 'passenger_difference_and_log.png', 'Passenger<br>Log and Difference')
]

for col_name, transform, file_name, y_label in transforms:
    data[col_name] = transform
    plot_data('Airline Passengers', data, 'Month', col_name, file_name, 'Date', y_label)

This refactoring follows PEP8 guidelines and adheres to the DRY (Don’t Repeat Yourself) principle, making the code more maintainable and professional.

Summary & Further Thoughts

Using ChatGPT has significantly boosted productivity in areas such as learning new concepts, writing unit tests, creating visualizations, and refactoring code. It is a powerful tool that, when integrated into the workflow, can save time and enhance the quality of work.


Scroll to Top