Dominic Nabiga

Where art meets data: Unveiling the captivating stories behind famous paintings through SQL analysis

SQL portfolio project

In the realm of art, captivating stories lie hidden beneath every brushstroke, waiting to be revealed. But what if we could unlock these tales not only through our eyes but also through the power of data? This portfolio project embarks on a journey to do just that, blending the artistry of the visual world with the precision of SQL data analysis.

Join me as I delve into a rich dataset of famous paintings and their creators, meticulously crafted from multiple sources. Together, we’ll explore the intricate relationships between artists, subjects, museums, and more, revealing insights that illuminate the world of art in a new light.

Transforming artistic fascination into data-driven insights

While the visual splendor of a masterpiece can stir our emotions, the stories behind its creation and display hold rich information waiting to be unraveled. This project goes beyond appreciating the artistry to transform our fascination into data-driven understanding using the powerful language of SQL.

By delving into a comprehensive dataset encompassing renowned paintings, artists, and museums, we’ll embark on a quest to answer intriguing business questions. Prepare to uncover fascinating truths, such as:

  • Hidden treasures: Which paintings remain unseen in museum halls?
  • Empty spaces: Do any museums stand devoid of artistic expression?
  • Market dynamics: Are there paintings valued higher than their regular price?
  • Hidden bargains: Which artistic gems are undervalued?
  • Popular choices: Which canvas sizes command the highest prices?
  • Data cleansing: How can we ensure the dataset’s accuracy by removing duplicates?
  • Operational efficiency: Which museums operate with invalid city information, hindering outreach?
  • Scheduling insights: Where can we find museums open on both Sundays and Mondays?
  • Museum dedication: Which museums showcase their collections every day?
  • Artistic acclaim: Who are the most prolific and celebrated artists?
  • Unveiling trends: What are the most and least popular canvas sizes and painting styles?
  • Open for business: Which museum boasts the longest daily opening hours?
  • Style havens: Where can we find the greatest concentration of a specific painting style?
  • Global reach: Which artists transcend borders with their exhibited works?
  • Cultural hubs: Which cities and countries foster the most vibrant art scenes?
  • Financial extremes: Where are the most expensive and least expensive paintings housed?
  • International recognition: Which nation boasts the fifth-highest number of paintings?
  • Evolving tastes: Which styles bask in the spotlight, and which fade into obscurity?
  • International acclaim: Which artist’s portraits resonate beyond their home nation?

Get ready to witness the beauty of art intertwined with the power of data as we embark on this analytical journey!

About the data

The dataset was taken from Kaggle; it contains data about famous artists and their paintings. Here is the link to the dataset

Here’s a brief overview of the files and their relationships:

1. artist.csv:

  • Data: Information about artists, including their names, nationalities, and birth/death years.
  • Relationships: Connects to the work.csv table via the artist_id to identify the artists who created each painting.

2. canvas_size.csv:

  • Data: Details about canvas sizes, including their labels (e.g., “small,” “medium,” “large”) and dimensions.
  • Relationships: Links to the work.csv table through the canvas_size_id to specify the canvas size of each painting.

3. image_link.csv:

  • Data: Contains URLs of images associated with each painting.
  • Relationships: Connects to the work.csv table using the painting_id to link image URLs to their corresponding paintings.

4. museum.csv:

  • Data: Information about museums, including their names, locations, and countries.
  • Relationships: Relates to the work.csv table through the museum_id to indicate where each painting is displayed.
  • Also connects to the museum_hours.csv table via the museum_id to provide information about museum opening hours.

5. museum_hours.csv:

  • Data: Details about museum opening hours, including days of the week and hours of operation.
  • Relationships: Links to the museum.csv table through the museum_id to provide opening hour information for each museum.

6. product_size.csv:

  • Data: Information about product sizes (likely for prints or merchandise related to the paintings), including dimensions and prices.
  • Relationships: Likely connects to the work.csv table to associate product sizes with specific paintings, but the exact relationship would need to be confirmed from the dataset structure.

7. subject.csv:

  • Data: Contains subject categories for paintings (e.g., “portrait,” “landscape,” “still life”).
  • Relationships: Links to the work.csv table through the subject_id to classify each painting’s subject matter.

8. work.csv:

  • Data: Information about individual paintings, including their titles, artists, creation dates, dimensions, prices, and museum locations.
  • Serves as the central table, connecting to multiple other tables to provide a comprehensive view of each painting’s details and context.

Data Preparation

Since the data are in CSV files, I had to upload them in MySQL and create tables for each. To prevent wasting tim by doing the process manually, I used a Python script to do the magic for me. Below is the code

import pandas as pd
from sqlalchemy import create_engine
from getpass import getpass

dbuser = "user1"
db_name = 'paintings'
dbpass = getpass()

conn_string = f"mysql+mysqlconnector://{dbuser}:{dbpass}@localhost:3306/{db_name}"
db = create_engine(conn_string)
conn = db.connect()

files = ['artist',
         'canvas_size',
         'image_link',
         'museum_hours',
         'museum',
         'product_size',
         'subject',
         'work']

for file in files:
    df = pd.read_csv(f'{file}.csv')
    df.to_sql(file, con=conn, if_exists='replace', index=False)

Answering Questions

Below are the questions and their SQL answers

1. Fetch all the paintings which are not displayed on any museums?

SELECT work_id, name, artist_id, style
FROM WORK
WHERE museum_id IS NULL;

2. Are there museums without any paintings?

SELECT m.name
FROM museum m
WHERE m.museum_id NOT IN
(SELECT museum_id
FROM work w
WHERE w.museum_id IS NOT NULL);
None. All the museums had at least one painting

3. How many paintings have an asking price of more than their regular price?

SELECT COUNT(*)
FROM product_size ps
WHERE sale_price>regular_price;

4. Identify the paintings whose asking price is less than 50% of its regular price

SELECT WORK.work_id, WORK.name, product_size.sale_price,product_size.regular_price
FROM product_size
JOIN WORK
ON WORK.work_id = product_size.work_id
WHERE sale_price <
(regular_price * 0.5);

5. Which canvas size costs the most?

SELECT canvas_size.size_id, canvas_size.label, product_size.sale_price
FROM canvas_size
JOIN product_size
ON canvas_size.size_id = product_size.size_id
ORDER BY product_size.sale_price DESC
LIMIT 1;

6. Identify the museums with invalid city information in the given dataset

SELECT museum.museum_id,museum.name,museum.city,museum.country
FROM museum
WHERE city not REGEXP '^[a-z]'

7. Fetch the top 10 most famous painting subject

SELECT SUBJECT,
COUNT(work_id),
RANK() OVER(ORDER by COUNT(work_id) DESC) AS ranking
FROM SUBJECT
GROUP BY SUBJECT
LIMIT 10

8. Identify the museums which are open on both Sunday and Monday. Display museum name, city.

SELECT mm.name, mm.city from
(SELECT * FROM museum_hours m
WHERE DAY IN ("Monday")) AS monhrs
JOIN museum mm
ON mm.museum_id = monhrs.museum_id
WHERE monhrs.museum_id in
(SELECT s.museum_id FROM museum_hours s
WHERE DAY IN ("Sunday"));

9. How many museums are open every single day?

SELECT COUNT(*) FROM
(SELECT museum_id, COUNT(DAY) num_days
FROM museum_hours GROUP BY museum_id) AS calc_days
WHERE calc_days.num_days=7;

Information about the museums that open every day

SELECT calc_days.museum_id,m.name,m.city, m.country
FROM
(SELECT museum_id, COUNT(DAY) num_days
FROM museum_hours GROUP BY museum_id) AS calc_days
JOIN museum m
ON m.museum_id=calc_days.museum_id
WHERE calc_days.num_days=7;

10. Which are the top 5 most popular museum? (Popularity is defined based on most no of paintings in a museum)

SELECT w.museum_id,m.name,m.city,m.country, COUNT(w.work_id) num_of_paintings
FROM work w
JOIN museum m
ON w.museum_id=m.museum_id
WHERE w.museum_id IS NOT NULL
GROUP BY w.museum_id,m.name,m.city,m.country
ORDER BY num_of_paintings DESC
LIMIT 5;

11. Who are the top 5 most popular artist? (Popularity is defined based on most no of paintings done by an artist)

SELECT w.artist_id, a.full_name,a.nationality,a.style,
COUNT(work_id) num_of_paintings
FROM work w
JOIN artist a
ON a.artist_id=w.artist_id
GROUP BY w.artist_id, a.full_name,a.nationality,a.style
ORDER BY num_of_paintings DESC
LIMIT 5;

12. Display the 3 least popular canvas sizes

SELECT ps.size_id,cv.label, COUNT(ps.size_id) canvas_popularity
FROM product_size ps
JOIN canvas_size cv
ON cv.size_id=ps.size_id
WHERE ps.size_id REGEXP '^[0-9]'
GROUP BY ps.size_id, cv.label
ORDER BY canvas_popularity ASC
LIMIT 3;

13. Which museum is open for the longest during a day. Dispay museum name, state and hours open and which DAY?

SELECT mh.museum_id,
mm.name,mm.city,
mh.open as open_time,
mh.close as close_time,
mh.day,
TIMEDIFF(STR_TO_DATE(mh.close, '%h:%i:%p'),
STR_TO_DATE(mh.open, '%h:%i:%p')) AS open_hours
FROM museum_hours mh
JOIN museum mm
ON mh.museum_id = mm.museum_id
ORDER BY open_hours DESC
LIMIT 1;

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top