Data Transformation and Predictive Modeling in Patient Satisfaction Analysis

doi:https://doi.org/10.5281/zenodo.20326573

Volume 15, Issue 05 (May 2026)

Data Transformation and Predictive Modeling in Patient Satisfaction Analysis

DOI : https://doi.org/10.5281/zenodo.20326573

Download Full-Text PDF Cite this Publication

Open Access
Article Download / Views: 2
Authors : Yogeshwaran R, Naresk K, Yuvaraj Pandian K
Paper ID : IJERTV15IS051540
Volume & Issue : Volume 15, Issue 05 , May – 2026
Published (First Online): 21-05-2026
ISSN (Online) : 2278-0181
Publisher Name : IJERT
License: This work is licensed under a Creative Commons Attribution 4.0 International License

PDF Version

View

Text Only Version

Data Transformation and Predictive Modeling in Patient Satisfaction Analysis

Yogeshwaran R , Naresk K , Yuvaraj Pandian K

Panimalar Engineering college, Chennai

5.1 SAMPLE CODE

Data Transformation

data_NK <- read.table(“PROG8435-24W-Final_train.txt”, header = TRUE, sep = “,”) data_NK <- as.data.frame(data_NK)

Importing data set

Display the structure of the R object

CHAPTER 5 IMPLEMENTATION

str(data_NK)

## data.frame: 912 obs. of 9 variables:

##	$ Index	: int	1 2 4 5	…
##	$ Age	: int	74 6	77 3 43	…
##	$ Serverity	: int	52 4	50 5	…

## $ Surgical.Medical: int 1 1 0 1 …

## $ Anxiety ## $ Type

… ## $ Marital ## $ FSA

: num 7 4.6 4.2 6 6.6 4.3 5.3 5.6 6.6 2.2 …

: chr “Neuro” “Plastic Surgery” “No Surgery” “No Surgery”

: chr “M” “M” “M” “S” …

: chr “V6Y” “E3H” “S4L” “O7U” …

## $ Satisfaction : int 55 4 53 7 …

dim(data_NK)

## [1] 912 9

##	1	1	74	52	1	7.0 Neuro	M V6Y
##	2	2	67	43	1	4.6 Plastic Surgery	M E3H
##	3	3	68	56	0	4.2 No Surgery	M S4L
##	4	4	44	45	0	6.0 No Surgery	S O7U
##	5	5	63	58	1	6.6 Orthopedic	M T5Y
## ##	1	Satisfaction 55
##	2	46
##	3	44
##	4	71
##	5	79

head(data_NK,5) ## Index Age Serverity Surgical.Medical Anxiety Type Marital FSA

Transform character variables (Type and Martial) to factor variable

data_NK$Type <- as.factor(data_NK$Type) data_NK$Marital <-as.factor(data_NK$Marital)

Removing Index and FSA column from the dataset as it does not add any value

data_NK$Index <- NULL data_NK$FSA <- NULL

Summary and Descriptive Statistics of the dataset after Transformation

#options(width = 90)

summary(data_NK)

## ##	Min.	Age :-2.00	Serverity Min. :19.00		Surgical.Medical Min. :0.0000		Anxiety Min. :1.900
##	1st Qu.:37.00		1st Qu.:40.00		1st Qu.:0.0000		1st Qu.:3.700
##	Median :52.00		Median :46.00		Median :1.0000		Median :4.700
##	Mean :51.36		Mean :46.75		Mean :0.5702		Mean :4.796
##	3rd Qu.:65.00		3rd Qu.:54.00		3rd Qu.:1.0000		3rd Qu.:5.900
##	Max.	:79.00	Max.	:81.00	Max.	:1.0000	Max.	:7.800
##	Type			Marital	Satisfaction
##	Abdominal 135			M:653	Min. : 26.00
##	Cardiovascular : 97			S:259	1st Qu.: 59.00
##	Neuro : 79				Median : 69.00
##	No Surgery 392				Mean : 68.08
## ##	Orthopedic : 80 Plastic Surgery:129				3rd Qu.: 78.00 Max. :102.00

nbr.na

##	nbr.val	912.0000000 912.0000000	912.00000000	912.00000000	NA
##	nbr.null	0.0000000 0.0000000	392.00000000	0.00000000	NA
##	nbr.na	0.0000000 0.0000000	0.00000000	0.00000000	NA
##	min	-2.0000000 19.0000000	0.00000000	1.90000000	NA
##	max	79.0000000 81.0000000	1.00000000	7.80000000	NA
##	range	81.0000000 62.0000000	1.00000000	5.90000000	NA
##	sum	46838.0000000 42635.0000000	520.00000000	4374.20000000	NA
##	median	52.0000000 46.0000000	1.00000000	4.70000000	NA
##	mean	51.3574561 46.7489035	0.57017544	4.79627193	NA
##	SE.mean	0.5355493 0.3266344	0.01640177	0.04534840	NA
##	CI.mean.0.95	1.0510538 0.6410434	0.03218964	0.08899948	NA
##	var	261.5735119 97.3013160	0.24534443	1.87550749	NA
##	std.dev	16.1732344 9.8641429	0.49532255	1.36949169	NA
##	coef.var	0.3149150 0.2110027	0.86871954	0.28553254	NA
## ##	nbr.val	Marital Satisfaction NA 912.0000000
##	nbr.null	NA 0.0000000
##	NA 0.0000000
##	min	NA 26.0000000

stat.desc(data_NK) ## Age Serverity Surgical.Medical Anxiety Type

##	max	NA	102.0000000
##	range	NA	76.0000000
##	sum	NA 62085.0000000
##	median	NA	69.0000000
##	mean	NA	68.0756579
##	SE.mean	NA	0.4512237
##	CI.mean.0.95	NA	0.8855588
##	var	NA	185.6858173
##	std.dev	NA	13.6266583
##	coef.var	NA	0.2001693

Preliminary Analysis

Creating boxplots to the dataframe to identidy outliers and gain insights

par(mfrow=c(3,1))

for (i in 1:ncol(data_NK))

{

if (is.numeric(data_NK[,i])) {

boxplot(data_NK[i], main=names(data_NK)[i], horizontal=TRUE, pch=18)

}

Age

0 20 40 60 80

Serverity

20 30 40 50 60 70 80

Surgical.Medical

0.0 0.2 0.4 0.6 0.8 1.0

barplot(table(data_NK$Type), cex.names=.75,main=’Most people are with no Surgeries’,xlab=’Type’)

Anxiety

2 3 4 5 6 7 8

Satisfaction

40 60 80 100

0 300

Most people are with no Surgeries

Abdominal Cardiovascular Neuro No Surgery Orthopedic Plastic Surgery

Type

barplot(table(data_NK$Marital), cex.names=.75,main=’There are higher number of married patients with di

0 500

There are higher number of married patients with disease

M S

Marital

Removing records in dataset where Age is less than 0

There is a patient with age -2. It is not technically not possible

data_NK <- data_NK[!data_NK$Age < 0,]

Correlations on the dataset

corrgram(data_NK, order=TRUE, lower.panel=panel.shade, upper.panel=panel.pie, text.panel=panel.txt, main=”Correlations”)

Anxiety

Satisfaction

Surgical.Medical

Age

Serverity

Correlations

Some observations on correlation

Satisfaction has a negative correlation with Age, shows that Satisfaction might decrease for older age people

Distribution

qqnorm(data_NK$Age , main=”Age”, pch=20)

qqline(data_NK$Age)

60

70

80

Age

Sample Quantiles

30

40

50

3 2

1 0

1 2 3

Theoretical Quantiles

qqnorm(data_NK$Serverity , main=”Serverity”, pch=20)

qqline(data_NK$Serverity)

60

70

80

Serverity

Sample Quantiles

20

30

40

50

3 2

1 0

1 2 3

Theoretical Quantiles

qqnorm(data_NK$Anxiety , main=”Anxiety”, pch=20)

qqline(data_NK$Anxiety )

5

6

7

8

Anxiety

Sample Quantiles

2

3

4

3 2

1 0

1 2 3

Theoretical Quantiles

qqnorm(data_NK$Satisfaction , main=”Satisfaction”, pch=20)

qqline(data_NK$Satisfaction)

Sample Quantiles

80

100

Satisfaction

40

60

3 2 1 0 1 2 3

Theoretical Quantiles

Some observations on distribution Age is not evenly distributed Serverity is moderately distributed Anxiety is not evenly distributed

Satisfaction is evenly distributed, but there is drop for higher values

Model Development

Splitting the dataframe to Train and Test

# Setting the seed based on student number

set.seed(1464)

# Choosing sampling rate

tr <- 0.80

ts <- 0.20

# Finding the number of rows of data

n_row <- nrow(data_NK)

# Choose the rows for the training sample

training_rows <- sample(1:n_row, tr * n_row, replace = FALSE)

# Assign to the training sample

train <- subset(data_NK[training_rows,])

# Assign the balance to the test sample

test <- subset(data_NK[-c(training_rows),])

Summary of train dataset

summary(train)

##	Age	Serverity		Surgical.Medical		Anxiety
##	Min. :24.0	Min. :19.00		Min. :0.0000		Min. :2.000
##	1st Qu.:37.0	1st Qu.:40.00		1st Qu.:0.0000		1st Qu.:3.700
##	Median :52.0	Median :46.00		Median :1.0000		Median :4.700
##	Mean :51.4	Mean	:46.71	Mean	:0.5673	Mean	:4.765
##	3rd Qu.:66.0 3rd Qu.:54.00			3rd Qu.:1.0000		3rd Qu.:5.825
## ##	Max. :79.0 Max. :81.00 Type Marital			Max. :1.0000 Satisfaction		Max. :7.800
##	Abdominal :103 M:518			Min. : 26.00
##	Cardiovascular : 85 S:210			1st Qu.: 59.00
##	Neuro : 64			Median : 68.00
##	No Surgery 315			Mean : 68.04
##	Orthopedic : 63			3rd Qu.: 78.00
##	Plastic Surgery: 98			Max. :102.00

Correlations on train dataset

options(width = 100)

ht <- hetcor(train) #from polycor library

round(ht$correlations,2)

##	Age	Serverity	Surgical.Medical	Anxiety	Type	Marital	Satisfaction
## Age	1.00	0.03	-0.01	-0.05	0.00	-0.09	-0.75
## Serverity	0.03	1.00	-0.03	-0.03	-0.01	-0.04	-0.32
## Surgical.Medical	-0.01	-0.03	1.00	-0.01	-0.16	-0.07	0.04
## Anxiety	-0.05	-0.03	-0.01	1.00	0.02	-0.03	0.04
## Type	0.00	-0.01	-0.16	002	1.00	0.10	-0.11
## Marital	-0.09	-0.04	-0.07	-0.03	0.10	1.00	0.00
## Satisfaction	-0.75	-0.32	0.04	0.04	-0.11	0.00	1.00
Creating Full model

# Creating full model

full_model <- lm(Satisfaction ~ ., data = train)

# Summary of full model

summary(full_model)

##

## Call:

## lm(formula = Satisfaction ~ ., data = train) ##

##	Residuals:
##	Min 1Q	Median	3Q	Max
##	-23.5785 -5.1477	-0.0803	4.9491	25.7570

##
## Coefficients:	(1	not	defined because of singularities)
##			Estimate Std. Error t value Pr(>\|t\|)

##	(Intercept)	120.16332	2.06181	58.280	< 2e-16	***
##	Age	-0.62968	0.01812	-34.758	< 2e-16	***
##	Serverity	-0.41132	0.02946	-13.960	< 2e-16	***
##	Surgical.Medical	2.78991	0.89549	3.116	0.001909	**
##	Anxiety	-0.11049	0.21469	-0.515	0.606943
##	TypeCardiovascular	-1.40456	1.15306	-1.218	0.223579
##	TypeNeuro	0.08295	1.25798	0.066	0.947446
##	TypeNo Surgery	NA	NA	NA	NA
##	TypeOrthopedic	-4.41496	1.26137	-3.500	0.000494	***
##	TypePlastic Surgery	-4.55395	1.11181	-4.096	0.0000468	***
##	MaritalS	-1.54240	0.64804	-2.380	0.017569	*
##	—
##	Signif. codes: 0 * 0.001 0.01 * 0.05 . 0.1 1

##

## Residual standard error: 7.864 on 718 degrees of freedom ## Multiple R-squared: 0.6711, Adjusted R-squared: 0.667 ## F-statistic: 162.8 on 9 and 718 DF, p-value: < 2.2e-16

# Predicting Satisfaction for training dataset pred_train_full <- predict(full_model, newdata = train) # Calculating RMSE for training dataset

RMSE_train_full <- sqrt(mean((train$Satisfaction – pred_train_full)2))

# Predicting Satisfaction for test dataset pred_test_full <- predict(full_model, newdata = test) # Calculating RMSE for test dataset

RMSE_test_full <- sqrt(mean((test$Satisfaction – pred_test_full)2))

round(RMSE_train_full, 2)

## [1] 7.81

round(RMSE_test_full, 2)

## [1] 7.47

Creating Stepwise model

# Creating Stepwise model stepwise_model <- step(full_model,trace=0) # Summary Stepwise selection model summary(stepwise_model)

##

## Call:

## lm(formula = Satisfaction ~ Age + Serverity + Type + Marital, ## data = train)

##

## Residuals:

## Min 1Q Median 3Q Max ## -23.465 -5.168 -0.028 4.950 25.485

##

## Coefficients:

## Estimate Std. Error t value Pr(>|t|)

##	(Intercept)	122.39530	1.83431	66.725	< 2e-16 ***
##	Age	-0.62916	0.01808	-34.802	< 2e-16 ***
##	Serverity	-0.41089	0.02944	-13.958	< 2e-16 ***
##	TypeCardiovascular	-1.42509	1.15178	-1.237	0.216381
##	TypeNeuro	0.04069	1.25466	0.032	0.974136
##	TypeNo Surgery	-2.81047	0.89414	-3.143	0.001740 **
##	TypeOrthopedic	-4.43181	1.26030	-3.516	0.000465 ***
##	TypePlastic Surgery	-4.57105	1.11074	-4.115	0.0000431 ***
##	MaritalS	-1.53289	0.64745	-2.368	0.018168 *
##	—
##	Signif. codes: 0 * 0.001 0.01 * 0.05 . 0.1 1

##

## Residual standard error: 7.86 on 719 degrees of freedom ## Multiple R-squared: 0.671, Adjusted R-squared: 0.6673 ## F-statistic: 183.3 on 8 and 719 DF, p-value: < 2.2e-16

# Predicting Satisfaction for training dataset pred_train_stepwise <- predict(stepwise_model, newdata = train) # Calculating RMSE for training dataset

RMSE_train_stepwise <- sqrt(mean((train$Satisfaction – pred_train_stepwise)2))

# Predicting Satisfaction for test dataset

pred_test_stepwise <- predict(stepwise_model, newdata = test)

# Calculating RMSE for test dataset

RMSE_test_stepwise <- sqrt(mean((test$Satisfaction – pred_test_stepwise)2))

round(RMSE_train_stepwise, 2)

## [1] 7.81

round(RMSE_test_stepwise, 2)

## [1] 7.46

Creating backward selection model

# Creating backward model

backward_model <- step(full_model, direction = “backward”,trace=0)

# Summary backward selection model

summary(backward_model)

##

## Call:

## lm(formula = Satisfaction ~ Age + Serverity + Type + Marital, ## data = train)

##

##	Residuals:
##	Min	1Q Median	3Q Max
##	-23.465 -5.168		-0.028	4.950 25.485
## ##	Coefficients:

## Estimate Std. Error t value Pr(>|t|)

< 2e-16 ***

##	(Intercept)	122.39530	1.83431	66.725	< 2e-16 ***
##	Age	-0.62916	0.01808	-34.802
##	Serverity	-0.41089	0.02944	-13.958	< 2e-16 ***
##	TypeCardiovascular	-1.42509	1.15178	-1.237	0.216381
##	TypeNeuro	0.04069	1.25466	0.032	0.974136
##	TypeNo Surgery	-2.81047	0.89414	-3.143	0.001740 **
##	TypeOrthopedic	-4.43181	1.26030	-3.516	0.000465 ***
##	TypePlastic Surgery	-4.57105	1.11074	-4.115	0.0000431 ***
##	MaritalS	-1.53289	0.64745	-2.368	0.018168 *
##	—
##	Signif. codes: 0 * 0.001 0.01 * 0.05 . 0.1 1

##

## Residual standard error: 7.86 on 719 degrees of freedom ## Multiple R-squared: 0.671, Adjusted R-squared: 0.6673 ## F-statistic: 183.3 on 8 and 719 DF, p-value: < 2.2e-16

# Predicting Satisfaction for training dataset pred_train_backward <- predict(backward_model, newdata = train) # Calculating RMSE for training dataset

RMSE_train_backward <- sqrt(mean((train$Satisfaction – pred_train_backward)2))

# Predicting Satisfaction for test dataset pred_test_backward <- predict(backward_model, newdata = test) # Calculating RMSE for test dataset

RMSE_test_backward <- sqrt(mean((test$Satisfaction – pred_test_backward)2))

round(RMSE_train_backward, 2)

## [1] 7.81

round(RMSE_test_backward, 2)

## [1] 7.46

Graphical observation of models

par(mfrow = c(2, 2))

plot(full_model,main=”Full Model”)

Full Model

Residuals

10

Residuals vs Fitted

Full Model

Standardized residuals

3 0 2 4

QQ Residuals

5

232

438

2532

438

20

50 60 70 80 90 3 2 1 0 1 2 3

Fitted values Theoretical Quantiles

Standardized residuals

Full Model

ScaleLocation

Full Model

1.0

2

Residuals vs Leverage

5 438

232

394

5

Cook’s dista7n2c0e

0.0

Standardized residuals

2

50 60 70 80 90 0.000 0.010 0.020 0.030

Fitted values Leverage

par(mfrow = c(1, 1))

par(mfrow = c(2, 2))

plot(stepwise_model,main=”Stepwise Model”)

Stepwise Model

Residuals

10

Residuals vs Fitted

Stepwise Model

Standardized residuals

QQ Residuals

5

232

438

2532

438

20

3 0 2

50 60 70 80 90 3 2 1 0 1 2 3

Fitted values Theoretical Quantiles

Standardized residuals

Stepwise Model

ScaleLocation

Stepwise Model

1.0

2

Residuals vs Leverage

5 438

232

5 394

Cook’s dista7n2c0e

0.0

Standardized residuals

2

50 60 70 80 90 0.000 0.010 0.020 0.030

Fitted values Leverage

par(mfrow = c(1, 1))

par(mfrow = c(2, 2))

plot(backward_model,main=”Backward Model”)

Backward Model

Residuals

10

Residuals vs Fitted

Backward Model

Standardized residuals

QQ Residuals

5

232

438

2532

438

20

3 0 2

50 60 70 80 90 3 2 1 0 1 2 3

Fitted values Theoretical Quantiles

Standardized residuals

Backward Model

ScaleLocation

Backward Model

1.0

2

Residuals vs Leverage

5 438

232

5 394

Cook’s dista7n2c0e

0.0

Standardized residuals

2

50 60 70 80 90 0.000 0.010 0.020 0.030

Fitted values Leverage

par(mfrow = c(1, 1))

The residuals are normally distributed in all three models

The cooks distance tells that leverage and influence is satisfiable

Comparison between full model, Stepwise model and Backward model

Tests	Full	Stepwise	Backward
F-Stat	Pass	Pass	Pass
R2	0.671	0.671	0.671
Adj R2	0.667	0.6673	0.6673
Res Er	7.864	7.86	7.86
T-P.val	Pass	Pass	Pass
RMSE Tr	7.81	7.81	7.81
RMSE Ts	7.47	7.46	7.46

From the comparison we can say all three model perform good

In Full model RMSE test is 7.47 which is .01 close to the training, So I choose Full model for my prediction

Model Prediction – Full Model

Preparing Test data for prediction

Final_data <- read.table(“PROG8435-24W-Final_test.txt”, header = TRUE, sep = “,”)

Transform character variables (Type and Martial) to factor variable

Final_data$Type <- as.factor(Final_data$Type) Final_data$Marital <-as.factor(Final_data$Marital)

Removing Index and FSA column from the dataset

Final_data$Index <- NULL Final_data$FSA <- NULL

Summary of prediction file

str(Final_data)

## data.frame: 203 obs. of 6 variables:

## $ Age : int 72 59 76 58 64 68 73 59 52 71 …

## $ Serverity : int 31 43 31 49 31 53 47 49 49 31 …

## $ Surgical.Medical: int 1 0 0 1 1 0 0 1 1 1 …

## $ Anxiety : num 4.9 6.3 6.4 2.9 2.8 5.2 2.7 4.5 6.9 6.4 …

## $ Type : Factor w/ 6 levels “Abdominal”,”Cardiovascular”,..: 1 4 4 2 3

4 4 1 5 6 …

## $ Marital : Factor w/ 2 levels “M”,”S”: 2 2 1 2 2 1 1 1 1 2 …

Using full model and predicting the Satisfaction

Predicted <- round(predict(full_model, newdata=Final_data),0)

head(Predicted)

## 1 2 3 4 5 6

## 63 63 59 63 68 55

Appending the Predicated value to the final dataset

test_final <- cbind(Final_data,Predicted) head(test_final)

## Age Serverity Surgical.Medical Anxiety Type Marital Predicted

## 1	72	31	1	4.9	Abdominal	S	63
## 2	59	43	0	6.3	No Surgery	S	63
## 3	76	31	0	6.4	No Surgery	M	59
## 4	58	49	1	2.9	Cardiovascular	S	63
## 5	64	31	1	2.8	Neuro	S	68
## 6	68	53	0	5.2	No Sugery	M	55

write.csv(test_final,”PROG8435-24W-Final-NSK.txt”)

Data Transformation and Predictive Modeling in Patient Satisfaction Analysis

Importing data set

Display the structure of the R object

Transform character variables (Type and Martial) to factor variable

Removing Index and FSA column from the dataset as it does not add any value

Summary and Descriptive Statistics of the dataset after Transformation

Creating boxplots to the dataframe to identidy outliers and gain insights

Age

Serverity

Surgical.Medical

Anxiety

Satisfaction

Most people are with no Surgeries

There are higher number of married patients with disease

Removing records in dataset where Age is less than 0

Correlations on the dataset

Some observations on correlation

Distribution

Some observations on distribution Age is not evenly distributed Serverity is moderately distributed Anxiety is not evenly distributed

Splitting the dataframe to Train and Test

Summary of train dataset

Correlations on train dataset

Creating Stepwise model

Creating backward selection model

Graphical observation of models

Preparing Test data for prediction

Creating Final Dataset