Introduction
The whole world has heard of ChatGPT, yet according to a recent study, only 2% of people use it daily. For data scientists, integrating ChatGPT into their daily workflow can significantly enhance productivity. This article explores various ways data scientists can use ChatGPT to improve their output.
Studying & Learning
One of the most valuable features of ChatGPT is its ability to simplify complex concepts. Using the prompt “explain like I’m 5” (ELI5), data scientists can grasp complicated topics in an easy-to-understand manner. For instance, asking ChatGPT to explain recurrent neural networks (RNNs) in simple terms helps build intuition.
Example
Prompt: Explain to me recurrent neural networks like I’m 5 years old.
Response:
By providing such intuitive explanations, ChatGPT acts like a personal tutor, saving time and helping data scientists understand concepts deeply.
Unit Tests
Writing unit tests can be tedious, but they are essential for ensuring robust code. ChatGPT can expedite this process by generating testing templates that can be customized for specific edge cases.
Example
Function:
def fibonacci_sequence(n):
if n <= 0:
return []
elif n == 1:
return [0]
elif n == 2:
return [0, 1]
sequence = [0, 1]
for i in range(2, n):
next_fibonacci_number = sequence[-1] + sequence[-2]
sequence.append(next_fibonacci_number)
return sequence
Prompt: Write unit tests for the Fibonacci sequence function.
Generated Unit Tests:
import unittest
def fibonacci_sequence(n):
# Function code here...
class TestFibonacciSequence(unittest.TestCase):
def test_negative_input(self):
self.assertEqual(fibonacci_sequence(-1), [])
def test_zero_input(self):
self.assertEqual(fibonacci_sequence(0), [])
def test_one_input(self):
self.assertEqual(fibonacci_sequence(1), [0])
def test_two_input(self):
self.assertEqual(fibonacci_sequence(2), [0, 1])
def test_five_input(self):
self.assertEqual(fibonacci_sequence(5), [0, 1, 1, 2, 3])
def test_ten_input(self):
self.assertEqual(fibonacci_sequence(10), [0, 1, 1, 2, 3, 5, 8, 13, 21, 34])
if __name__ == '__main__':
unittest.main()
Running these tests confirms their correctness and saves considerable time in setup and execution.
Creating Visualizations
Creating detailed visualizations can be time-consuming. ChatGPT simplifies this by generating the necessary Python code to create plots.
Example
Using data from Kaggle, data scientists can input the dataset into ChatGPT and request visualizations.
Prompt: Create a bar chart of show_id by country.
Input Data:
Generated Plot and Code:
import matplotlib.pyplot as plt
import pandas as pd
data = pd.read_csv('path_to_your_data.csv')
data['country'].value_counts().head(10).plot(kind='bar')
plt.title('Show ID by Country')
plt.xlabel('Country')
plt.ylabel('Show ID Count')
plt.show()
Refining the prompt can improve the plot by focusing on the top 10 countries.
Refactoring Code
Refactoring code to adhere to best practices is another area where ChatGPT excels. Data scientists can input their code and ask ChatGPT to make it “production standard.”
Example
Original Code:
import plotly.express as px
import pandas as pd
import os
import numpy as np
data = pd.read_csv('AirPassengers.csv')
def plotting(title, data, x, y, save_file_path, x_label, y_label):
fig = px.line(data, x=data[x], y=data[y], labels={x: x_label, y: y_label})
fig.update_layout(template="simple_white", font=dict(size=18),
title_text=title, width=650,
title_x=0.5, height=400)
if not os.path.exists("../images"):
os.mkdir("../images")
fig.write_image("../images/" + str(save_file_path))
fig.show()
plotting('Airline Passengers', data, 'Month', '#Passengers', 'passengers.png', 'Date', 'Passengers')
data["Passenger_Diff"] = data["#Passengers"].diff()
plotting('Airline Passengers', data, 'Month', 'Passenger_Diff', 'passengers_one_difference.png', 'Date', 'Passengers<br>Difference Transform')
data["Passenger_Log"] = np.log(data["#Passengers"])
plotting('Airline Passengers', data, 'Month', 'Passenger_Log', 'passenger_log.png', 'Date', 'Passengers<br>Log Transform')
data["Passenger_Diff_Log"] = data["Passenger_Log"].diff()
plotting('Airline Passengers', data, 'Month', 'Passenger_Diff_Log', 'passenger_difference_and_log.png', 'Date', 'Passengers<br>Log and Difference')
Refactored Code:
import plotly.express as px
import pandas as pd
import os
import numpy as np
data = pd.read_csv('AirPassengers.csv')
def plot_data(title, data, x, y, save_file_path, x_label, y_label):
fig = px.line(data, x=x, y=y, labels={x: x_label, y: y_label})
fig.update_layout(template="simple_white", font=dict(size=18), title_text=title, width=650, title_x=0.5, height=400)
os.makedirs("../images", exist_ok=True)
fig.write_image(f"../images/{save_file_path}")
fig.show()
plot_data('Airline Passengers', data, 'Month', '#Passengers', 'passengers.png', 'Date', 'Passengers')
transforms = [
("Passenger_Diff", data["#Passengers"].diff(), 'passengers_one_difference.png', 'Passengers<br>Difference Transform'),
("Passenger_Log", np.log(data["#Passengers"]), 'passenger_log.png', 'Passenger<br>Log Transform'),
("Passenger_Diff_Log", np.log(data["#Passengers"]).diff(), 'passenger_difference_and_log.png', 'Passenger<br>Log and Difference')
]
for col_name, transform, file_name, y_label in transforms:
data[col_name] = transform
plot_data('Airline Passengers', data, 'Month', col_name, file_name, 'Date', y_label)
This refactoring follows PEP8 guidelines and adheres to the DRY (Don’t Repeat Yourself) principle, making the code more maintainable and professional.
Summary & Further Thoughts
Using ChatGPT has significantly boosted productivity in areas such as learning new concepts, writing unit tests, creating visualizations, and refactoring code. It is a powerful tool that, when integrated into the workflow, can save time and enhance the quality of work.