Intro

Predicting Early Hospital Readmissions in Diabetic Patients

Every time a patient ends up back in the hospital within 30 days of being discharged, it’s not just stressful — it’s expensive, and often preventable. For people living with diabetes, things like unstable blood sugar, medication changes, or lack of proper follow-up care can make readmission more likely.

In this project, I worked with real hospital data from 130 U.S. facilities, covering over 10 years of patient visits.

The goal is to find patterns that help explain why some diabetic patients are readmitted so quickly — and what signals we might use to catch those risks early.

  • My Task

    The Hospital Management Team has tasked us with reducing 30-day readmission rates for diabetic patients as part of a broader goal to cut hospital costs by 8% this year.

    The team asked me to identify key risk factors and intervention opportunities.

  • Why Hospitals Care About 30-Day Readmissions?

    • High readmission rates hurt hospitals — financially and reputationally.
    • Readmissions also affect public rankings, accreditation, and patient trust.
    • Hospitals want more patients, not repeat visits — doing it right the first time pays off.
  • My Approach

    I applied a full ETL and data analysis pipeline to uncover patterns in diabetic patient visits:

    • Cleaned and transformed data across 50+ features
    • Identified high-risk patient segments
    • Proposed data-driven to target preventable readmissions
  • Question 1

    What patient characteristics are most strongly associated with 30-day readmission among diabetic patients?

  • Question 2

    Are there specific combinations of treatments, discharge plans, or follow-up care gaps that correlate with higher readmission rates?

  • Question 3

    Are some admission types or discharge plans consistently better at preventing readmission for similar patients?”

To get started!

ETL Workflow

Extract patient data from an open-source healthcare dataset.

  1. diabetic_data.csv Contains patient-level hospital encounters including demographics, diagnoses, treatments, lab procedures, and readmission outcomes.
  2. IDS_mapping.csv Provides mapping for coded variables such as admission type, discharge disposition, and admission source.
  3. Hospital_General_Information.csv Contains facility-level information, including hospital ratings and service offerings (not directly linked but analyzed for context).

ETL Workflow

Transform the raw data by cleaning and shaping it into a structured format suitable for further exploration

  1. Replaced inconsistent missing markers (?, "Not Available", "Not Mapped", "Unknown/Invalid") with NaN for clarity and consistency.
  2. Dropped the weight column due to excessive missingness (>95%)
  3. Retained columns with sufficient completeness (>90%) for analysis.
  4. Mapped coded variables like admission_type_id, discharge_disposition_id, and admission_source_id using IDS_mapping.csv to make categories interpretable.
  5. Bucketed ICD-9 codes from diag_1 into high-level diagnosis groups (e.g., Diabetes, Heart Failure, COPD) to identify common clinical themes in early readmissions.

ETL Workflow

Load the prepared dataset into my analysis environment

The cleaned datasets were loaded into a local SQLite database (healthcare_project.db) using SQLAlchemy. This enables fast querying using SQL for analysis and joins.

Tables created:

  1. diabetic_data (main patient dataset)
  2. ids_mapping (lookup table)
  3. hospital_info (hospital context)

Now I am ready to tie back and answering above 3 questions

Data Question 1 : What patient characteristics are most strongly associated with 30-day readmission among diabetic patients?

1 of 2

Data Question 2 : Are there specific combinations of treatments, discharge plans, or follow-up care gaps that correlate with higher readmission rates?

1 of 5

Data Question 3: Are some admission types or discharge plans consistently better at preventing readmission for similar patients?”

1 of 5