Introduction
With the rise of AI, we’ve come to rely extra on data-driven decision-making to simplify the lives of working professionals. Whether or not it’s provide chain logistics or approving a mortgage for a buyer, information holds the important thing. Leveraging the ability of information science within the medical subject can yield groundbreaking outcomes. By analyzing huge quantities of recent drugs, information scientists can uncover patterns that may result in discoveries and coverings. With the potential to revolutionize the healthcare business, integrating information science into the medical area is not only a good suggestion; it’s a necessity.

Studying Targets
On this article, we’ll discover the right way to analyze a medical dataset to create a mannequin that predicts which drugs a affected person ought to take when confronted with a selected analysis. It sounds intriguing, so let’s dive proper in!
This text was revealed as part of the Information Science Blogathon.
Dataset
We might be downloading and utilizing the Open Supply dataset from kaggle:
The dataset comprises
- Age and gender of the affected person
- Prognosis of the affected person
- Antibiotics used to deal with affected person
- Dosage of the antibiotics in grams
- Route of utility of antibiotics
- Frequency of utilization of antibiotics
- Length of remedy utilizing antibiotics in days
- Indiction of antibiotics
Loading Libraries and Information set
Right here, we will import related libraries for the required for our train. Then we will load the dataset right into a dataframe and consider few rows.
import numpy as np # linear algebra
import pandas as pd # information processing, CSV file I/O (e.g. pd.read_csv)
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.pyplot as plt
import collections
from sklearn.feature_extraction.textual content import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix,classification_report
df = pd.read_csv("/kaggle/enter/hospital-antibiotics-usage/Hopsital Dataset.csv")
print(df.form)
df.head()
Output:

The column names are fairly intuitive they usually do make sense. Allow us to study by just a few statistics.
Fundamental Statistics
# "Let us take a look at some stats"
df.data()
Output:
<class 'pandas.core.body.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Information columns (complete 10 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 Age 833 non-null object
1 Date of Information Entry 833 non-null object
2 Gender 833 non-null object
3 Prognosis 833 non-null object
4 Title of Drug 833 non-null object
5 Dosage (gram) 833 non-null object
6 Route 833 non-null object
7 Frequency 833 non-null object
8 Length (days) 833 non-null object
9 Indication 832 non-null object
dtypes: object(10)
reminiscence utilization: 65.2+ KB
Observations:
- Each column is a string column
- Let’s convert column Age, Dosage(gram), Length(days) to numeric.
- Let’s additionally convert Date of Information Entry into date time.
- Indication has one null worth, all different column don’t have null values.
Information Preprocessing
Let’s clear some columns. Within the earlier step, we noticed that each one columns are integers. So, for starters, we’ll convert Age, Dosage and Length to numeric values. Equally, we’ll convert the Date of Information Entry into datetime kind. As an alternative of instantly changing them, we’ll create new columns i.e. we’ll create a Age2 column that might be a numeric model of Age column and so forth.
df['Age2'] = pd.to_numeric(df['Age'],errors="coerce")
df['Dosage (gram)2'] = pd.to_numeric(df['Dosage (gram)'],errors="coerce")
df['Duration (days)2'] = pd.to_numeric(df['Duration (days)'],errors="coerce")
df['Date of Data Entry2'] = pd.to_datetime(df['Date of Data Entry'],errors="coerce")
df.data()
Output:
<class 'pandas.core.body.DataFrame'>
RangeIndex: 833 entries, 0 to 832
Information columns (complete 14 columns):
# Column Non-Null Depend Dtype
--- ------ -------------- -----
0 Age 833 non-null object
1 Date of Information Entry 833 non-null object
2 Gender 833 non-null object
3 Prognosis 833 non-null object
4 Title of Drug 833 non-null object
5 Dosage (gram) 833 non-null object
6 Route 833 non-null object
7 Frequency 833 non-null object
8 Length (days) 833 non-null object
9 Indication 832 non-null object
10 Age2 832 non-null float64
11 Dosage (gram)2 831 non-null float64
12 Length (days)2 831 non-null float64
13 Date of Information Entry2 831 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(3), object(10)
reminiscence utilization: 91.2+ KB
After changing, we see some null values. Allow us to take a look at that information.
df[(df['Dosage (gram)2'].isnull())
| (df['Duration (days)2'].isnull())
| (df['Age2'].isnull())
| (df['Date of Data Entry2'].isnull())
]#import csv
Output:

Looks like some rubbish values in dataset. Let’s take away them.
Now, we will even substitute the values of latest column within the older columns and drop the newly created columns
df = df[~((df['Dosage (gram)2'].isnull())
| (df['Duration (days)2'].isnull())
| (df['Age2'].isnull())
| (df['Date of Data Entry2'].isnull()))
]
df['Age'] = df['Age2'].astype('int')
df['Dosage (gram)'] = df['Dosage (gram)2']
df['Date of Data Entry'] = df['Date of Data Entry2']
df['Duration (days)'] = df['Duration (days)2'].astype('int')
df = df.drop(['Age2','Dosage (gram)2','Date of Data Entry2','Duration (days)2'],axis=1)
print(df.form)
df.head()
Output:

Taking a look at Some Stats
Now, we will take a look at some statistics throughout all columns. We are going to use the describe operate and cross the embrace argument to get the values statistics throughout all columns. Allow us to study that
df.describe(embrace="all")
Output:

Observations:
- We will see that the age is 1, and the max is 90.
- Equally, trying on the Information Entry column – we all know that it comprises information for 19-Dec-2019 between 1 and seven pm.
- Gender has 2 distinctive values.
- Dosage has a minimal worth of 0.02 grams and a max weight of 960 grams (appears excessive).
- Length has a price of 1 and a max worth of 28 days.
- Prognosis, Title of Drug, Route, Frequency, and Indication columns have totally different values. We are going to study them.
Univariate EDA Route and Frequency Column
Right here, we’ll attempt to study the Route and Frequency column. We are going to use the value_counts() operate.
show(df['Route'].value_counts())
print()
df['Frequency'].value_counts()
Output:
Route
IV 534
Oral 293
IM 4
Title: depend, dtype: int64
Frequency
BD 430
TDS 283
OD 110
QID 8
Title: depend, dtype: int64
Observations:
- There are 3 totally different routes of Drug remedy. Probably the most used Route of Drug remedy is IV, and the least used is IM.
- There are 4 totally different Frequency of Drug remedy. BD is essentially the most used Frequency of Drug remedy, and the least used is QID.
Univariate EDA Prognosis Column
Right here, we’ll study the Prognosis column. We are going to use the value_counts() operate.
df['Diagnosis'].value_counts()
Output:

Upon operating the above code, we see that there are 263 totally different values. Additionally, we are able to see that every affected person can have a number of analysis (separated by comma). So, allow us to attempt to construct a wordcloud to make extra sense of this column. We are going to attempt to take away stopwords from the wordcloud to filter out pointless noise.
textual content = " ".be part of(analysis for analysis in adf.Prognosis)
print ("There are {} phrases within the mixture of all analysis.".format(len(textual content)))
stopwords = set(STOPWORDS)
# Generate a phrase cloud picture
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(textual content)
# Show the generated picture:
# the matplotlib manner:
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.present()
Output:
There are 35359 phrases within the mixture of all analysis.

Observations:
- There are 35359 phrases within the mixture of all analysis. Quite a lot of them might be repetitive.
- Wanting on the wordcloud, Chest an infection and koch lung appear to be extra frequent analysis.
- We will see different circumstances like – diabetes, myeloma, and many others.
Let’s take a look at the highest and backside 10 phrases/phrases. We are going to use the Counter operate throughout the collections library.
A = collections.Counter([i.strip().lower()
for i in text.split(',') if (i.strip().lower()) not in stopwords ])
print('Prime 10 phrases/phrases')
show(A.most_common(10))
print('nBottom 10 phrases/phrases')
show(A.most_common()[-11:-1])
Output:
Prime 10 phrases/phrases
[('col', 77),
('chest infection', 68),
('ihd', 55),
('copd', 40),
('hypertension', 38),
('ccf', 36),
('type 2 dm', 32),
("koch's lung", 28),
('ckd', 28),
('uraemic gastritis', 18)]
Backside 10 phrases/phrases
[('poorly controlled dm complicated uti', 1),
('poorly controlled dm septic shock', 1),
('hypertension hepatitis', 1),
('uti hepatitis', 1),
('uti uti', 1),
('cff ppt by chest infection early col', 1),
('operated spinal cord', 1),
('hematoma and paraplegia he', 1),
('rvi with col glandular fever', 1),
('fever with confusion ccf', 1)]
Observations:
- Among the many prime 10 phrases/phrases, col and chest infections are extra frequent. 77 sufferers are identified with col, and 68 are identified with chest an infection.
- Among the many backside 10 phrases/phrases, we see that every of the listed ones happens solely as soon as.
Univariate EDA Indication Column
Right here, we’ll study the Indication column. We are going to use the value_counts() operate.
show(df['Indication'].value_counts())
print()
df['Indication'].value_counts(1)
Output:
Indication
chest an infection 92
col 32
uti 30
kind 2 dm 25
prevention of an infection 22
..
pad(lt u.l) 1
outdated stroke 1
fainting assault 1
cheat an infection 1
centepede chew 1
Title: depend, Size: 220, dtype: int64
Indication
chest an infection 0.110843
col 0.038554
uti 0.036145
kind 2 dm 0.030120
prevention of an infection 0.026506
...
pad(lt u.l) 0.001205
outdated stroke 0.001205
fainting assault 0.001205
cheat an infection 0.001205
centepede chew 0.001205
Title: proportion, Size: 220, dtype: float64
We see that there are 220 values of this Indication column. Looks like some spelling errors are prevalent as properly – chest vs cheat. We will clear the identical. For present train – allow us to contemplate prime 25% indications.
top_indications = df['Indication'].value_counts(1).reset_index()
top_indications['cum_proportion'] = top_indications['proportion'].cumsum()
top_indications = top_indications[top_indications['cum_proportion']<0.25]
top_indications
Output:
Indication | proportion | cum_proportion | |
---|---|---|---|
0 | chest an infection | 0.110843 | 0.110843 |
1 | col | 0.038554 | 0.149398 |
2 | uti | 0.036145 | 0.185542 |
3 | kind 2 dm | 0.030120 | 0.215663 |
4 | prevention of an infection | 0.026506 | 0.242169 |
We are going to use this dataframe within the bivariate evaluation and modeling train later.
Univariate EDA Title of Drug column
Right here, we’ll study this column. We are going to use the value_counts() operate.
show(df['Name of Drug'].nunique())
show(df['Name of Drug'].value_counts())
Output:
55
Title of Drug
ceftriaxone 221
co-amoxiclav 162
metronidazole 59
cefixime 58
septrin 37
clarithromycin 32
levofloxacin 31
amoxicillin+flucloxacillin 29
ceftazidime 24
cefepime 14
cefipime 13
clindamycin 12
rifaximin 10
amikacin 9
cefoperazone 9
coamoxiclav 9
meropenem 8
ciprofloxacin 7
gentamicin 5
pen v 5
rifampicin 5
azithromycin 5
cifran 4
mirox 4
amoxicillin 4
streptomycin 4
ceftazidine 4
clarthromycin 4
amoxicillin+flucoxacillin 3
cefoparazone+sulbactam 3
linezolid 3
ofloxacin 3
norfloxacin 3
imipenem 2
flucloxacillin 2
ceftiaxone 2
cefaziclime 2
ceftriaxone+sulbactam 2
cefexime 2
pipercillin+tazobactam 1
amoxicillin+flucloxiacillin 1
pentoxifylline 1
menopem 1
levefloxacin 1
pentoxyfylline 1
doxycyclin 1
amoxicillin+flucoxiacillin 1
vancomycin 1
cefteiaxone 1
dazolic 1
amoxicillin+flucloaxcin 1
amoxiclav 1
doxycycline 1
cefoperazone+sulbactam 1
nitrofurantoin 1
Observations:
- There are 55 distinctive medication.
- Once more, we see spelling errors – ceftriaxone vs ceftiaxone. The distinctive drug depend could also be decrease.
Allow us to contemplate the highest 5 medication right here. We are going to create a column cum_proportion to retailer the cumulative proportion that every drug is contributing.
top_drugs = (df['Name of Drug'].value_counts(1).reset_index())
top_drugs['cum_proportion'] = top_drugs['proportion'].cumsum()
top_drugs = top_drugs.head()
top_drugs
Output:
Title of Drug | proportion | cum_proportion | |
---|---|---|---|
0 | ceftriaxone | 0.265945 | 0.265945 |
1 | co-amoxiclav | 0.194946 | 0.460890 |
2 | metronidazole | 0.070999 | 0.531889 |
3 | cefixime | 0.069795 | 0.601685 |
4 | septrin | 0.044525 | 0.646209 |
The prime 5 medication are given to 64.6% of the sufferers.
PS : If we appropriate the spelling errors – this could possibly be greater than 64.6%.
We are going to use this dataframe within the bivariate evaluation and modelling train later.
Bivariate Evaluation Indication vs Title of Drug
Right here, we’ll contemplate the top_indications and the top_drugs dataframes we created and attempt to see the distribution throughout them i.e. we evaluate prime 5 drug title vs prime 25% Indication. We are going to use pivot_table() operate.
(df[(df['Indication'].isin(top_indications['Indication']))
&(df['Name of Drug'].isin(top_drugs['Name of Drug']))]
.pivot_table(index='Indication',columns="Title of Drug",values="Age",aggfunc="depend")
)
Output:
Title of Drug | cefixime | ceftriaxone | co-amoxiclav | metronidazole | septrin |
---|---|---|---|---|---|
Indication | |||||
chest an infection | 7.0 | 22.0 | 27.0 | 3.0 | 1.0 |
col | 1.0 | 14.0 | 3.0 | 3.0 | 1.0 |
prevention of an infection | 2.0 | 6.0 | 3.0 | 2.0 | 1.0 |
kind 2 dm | 2.0 | 7.0 | 4.0 | 1.0 | NaN |
uti | 3.0 | 9.0 | 2.0 | NaN | NaN |
Observations:
- for a chest an infection, co-amoxiclav is essentially the most really useful drugs, adopted by ceftriaxone.
- Related observations might be drawn for different indications.
Observe: These medicines are prescribed below a number of different components – just like the Age, Gender, Prognosis, and Medical historical past of the affected person.
Bivariate Evaluation Indication vs Age
Right here, we’ll attempt to perceive if just a few circumstances seem for older sufferers. We are going to contemplate the top_indications and study the imply and median values of age.
(df[df['Indication'].isin(top_indications['Indication'])]
.groupby('Indication')['Age']
.agg(['mean','median','count'])
)
Output:
imply | median | depend | |
---|---|---|---|
Indication | |||
chest an infection | 57.173913 | 61.5 | 92 |
col | 48.031250 | 48.0 | 32 |
prevention of an infection | 42.136364 | 46.0 | 22 |
kind 2 dm | 63.560000 | 62.0 | 25 |
uti | 50.233333 | 53.5 | 30 |
Observations:
- Indications – col and prevention of an infection are extra noticed in youthful sufferers.
- uti is noticed in mid-age sufferers
- Indications: chest an infection and kind 2 dm are extra noticed in older sufferers.
Bivariate Evaluation Title of Drug vs Age
Right here, we’ll attempt to perceive if particular medication are used for older sufferers. We are going to contemplate the top_drugs and study the imply and median values of age.
(df[df['Name of Drug'].isin(top_drugs['Name of Drug'])]
.groupby('Title of Drug')['Age']
.agg(['mean','median','count']).sort_values(by='median')
)
Output:
imply | median | depend | |
---|---|---|---|
Title of Drug | |||
septrin | 44.513514 | 40.0 | 37 |
cefixime | 43.137931 | 42.0 | 58 |
ceftriaxone | 50.484163 | 49.0 | 221 |
metronidazole | 53.661017 | 54.0 | 59 |
co-amoxiclav | 56.518519 | 60.0 | 162 |
Observations:
- septrin and cefixime is prescribed to youthful sufferers
- ceftriaxone is prescribed to mid-age sufferers
- metronidazole and co-amoxiclav is prescribed to older sufferers
Modeling Method
Right here, we’ll attempt to assist the Pharmacist or the Prescribing Physician. The issue assertion is to establish which Drug might be given to the affected person foundation the Prognosis, Age and Gender column.
Few issues and assumptions:
- For this train – we’ll contemplate solely the highest 5 medication and mark the remainder as Others.
- So, for every worth of (Prognosis, Age, Gender) – the target is to establish the drug that might be really useful. The baseline mannequin is (1/6) = 16.67% accuracy.
Allow us to see if we are able to attempt to beat that. We are going to create a replica of the dataframe for the modelling train.
adf = df.copy()
adf['Output'] = np.the place(df['Name of Drug'].isin(top_drugs['Name of Drug']),
df['Name of Drug'],'Different')
adf['Output'].value_counts()
Output:
Output
Different 294
ceftriaxone 221
co-amoxiclav 162
metronidazole 59
cefixime 58
septrin 37
Title: depend, dtype: int64
We have now seen this distribution earlier as properly. It signifies that the courses will not be evenly distributed. We might want to hold this in thoughts whereas initializing the mannequin.
Characteristic Engineering
We are going to attempt to seize extra frequent phrases/ngrams within the analysis column. We are going to use the Depend Vectorizer module.
vectorizer = CountVectorizer(max_features=150,stop_words="english",
ngram_range=(1,3))
X = vectorizer.fit_transform(adf['Diagnosis'].str.decrease())
vectorizer.get_feature_names_out()
Output:
array(['abscess', 'acute', 'acute ge', 'af', 'aki', 'aki ckd',
'aki ckd retro', 'alcoholic', 'anaemia', 'art', 'bite', 'bleeding',
'bone', 'ca', 'cap', 'ccf', 'ccf chest', 'ccf chest infection',
'ccf increased', 'ccf increased lft', 'ccf koch', 'ccf koch lung',
'cerebral', 'cerebral infarct', 'chest', 'chest infection',
'chest infection pre', 'chronic', 'ckd', 'ckd chest infection',
'ckd retro', 'col', 'col portal', 'col portal hypertension',
'copd', 'copd chest', 'copd chest infection', 'debility',
'debility excessive', 'debility excessive vomitting', 'diabetes',
'disease', 'disease renal', 'disease renal impairment', 'dm',
'dm ihd', 'edema', 'effusion', 'encephalopathy', 'excessive',
'excessive vomitting', 'excessive vomitting uraemic', 'failure',
'fever', 'gastritis', 'gastritis hcv', 'gastritis hcv aki', 'ge',
'general', 'general debility', 'general debility excessive',
'gi bleeding', 'hcv', 'hcv aki', 'hcv aki ckd', 'hcv col', 'heart',
'hepatic', 'hepatic encephalopathy', 'hepatitis', 'ht',
'ht disease', 'ht disease renal', 'hypertension', 'ihd',
'impairment', 'impairment koch', 'impairment koch lungs',
'increased', 'increased lft', 'infarct', 'infection',
'infection pre', 'infection pre diabetes', 'koch', 'koch lung',
'koch lung copd', 'koch lungs', 'koch lungs ccf', 'left',
'left sided', 'leg', 'lft', 'lung', 'lung copd', 'lung copd chest',
'lungs', 'lungs ccf', 'lungs ccf increased', 'marrow', 'multiple',
'multiple myeloma', 'multiple myeloma ckd', 'myeloma',
'myeloma ckd', 'newly', 'old', 'pleural', 'pleural effusion',
'pneumonia', 'portal', 'portal hypertension', 'pre',
'pre diabetes', 'pulmonary', 'pulmonary edema', 'renal',
'renal impairment', 'renal impairment koch', 'retro', 'right',
'rvi', 'rvi stage', 'rvi stage ht', 'septic', 'septic shock',
'severe', 'severe anaemia', 'severe anaemia multiple', 'shock',
'sided', 'snake', 'snake bite', 'stage', 'stage ht',
'stage ht disease', 'stroke', 'tb', 'type', 'type dm',
'type dm ihd', 'uraemic', 'uraemic gastritis',
'uraemic gastritis hcv', 'uti', 'uti type', 'uti type dm',
'vomitting', 'vomitting uraemic', 'vomitting uraemic gastritis'],
dtype=object)
The above record showcases the highest 150 ngrams noticed within the analysis column after eradicating the stopwords.
Dataset Creation
Right here, we’ll create a single dataframe containing the options we simply made and the age and gender column as enter. We are going to use Label Encoder to transform the Drug Names into numeric values.
feature_df = pd.DataFrame(X.toarray(),columns=vectorizer.get_feature_names_out())
feature_df['Age'] = adf['Age'].fillna(0).astype('int')
feature_df['Gender_Male'] = np.the place(adf['Gender']=='Male',1,0)
le = LabelEncoder()
feature_df['Output'] = le.fit_transform(adf['Output'])
Now, we’ll do a prepare take a look at break up. We are going to hold 20% of the info because the take a look at set. We are going to use the random_state argument to make sure reproducibility.
X_train, X_test, y_train, y_test = train_test_split(
feature_df.drop('Output',axis=1).fillna(-1),
feature_df['Output'],
test_size=0.2, random_state=42)
Modelling
Right here, I’ve tried utilizing the Random Forest mannequin. You possibly can attempt with different fashions as properly. We are going to use the random_state argument to make sure reproducibility. We are going to use the class_weight parameter since we had seen earlier that the courses will not be evenly distributed.
clf = RandomForestClassifier(max_depth=6, random_state=0,
class_weight="balanced")
clf.match(X_train, y_train)
Allow us to see the accuracy and different metrics for the coaching dataset,
# accuracy on X_train information
final_accuracy = clf.rating(X_train, y_train)
print("final_accuracy is : ",final_accuracy)
# making a confusion matrix for figuring out and visualizing the accuracy rating
clf_predict = clf.predict(X_train)
print(classification_report(y_train, clf_predict,
target_names=record(mapping_df['Actual_Name'])))
Output:
final_accuracy is : 0.411144578313253
precision recall f1-score help
Different 0.84 0.30 0.44 236
cefixime 0.16 0.86 0.27 49
ceftriaxone 0.74 0.26 0.39 176
co-amoxiclav 0.48 0.47 0.48 127
metronidazole 0.39 0.59 0.47 49
septrin 0.45 0.93 0.61 27
accuracy 0.41 664
macro avg 0.51 0.57 0.44 664
weighted avg 0.64 0.41 0.43 664
Equally, allow us to see accuracy and different metrics for the take a look at dataset.
# accuracy on X_test information
final_accuracy = clf.rating(X_test, y_test)
print("final_accuracy is : ",final_accuracy)
# making a confusion matrix for figuring out and visualizing the accuracy rating
clf_predict = clf.predict(X_test)
print(classification_report(y_test, clf_predict,
target_names=record(mapping_df['Actual_Name'])))
Output:
final_accuracy is : 0.38323353293413176
precision recall f1-score help
Different 0.71 0.38 0.49 58
cefixime 0.08 0.56 0.14 9
ceftriaxone 0.36 0.09 0.14 45
co-amoxiclav 0.64 0.71 0.68 35
metronidazole 0.31 0.40 0.35 10
septrin 0.44 0.40 0.42 10
accuracy 0.38 167
macro avg 0.42 0.42 0.37 167
weighted avg 0.53 0.38 0.41 167
Key Observations:
- The mannequin is 41.11% correct on prepare information and 38.32% on take a look at information.
- If we take a look at the f1-score, we see that for Different drug names – the coaching dataset had a price of 0.44, and the take a look at dataset had a price of 0.49
- We will additionally see that the take a look at f1-score is decrease for cefixime and ceftriaxone. Whereas cefixime has solely 9 samples, ceftriaxone has 45 samples. So, we have to analyze these information factors to know the scope of bettering our function units.
Conclusion
On this article, we analyzed a medical dataset finish to finish. Then, we cleaned the dataset. We noticed primary statistics, distribution, and even a wordcloud to know the columns. Then, we created an issue assertion to assist the Pharmacist or the Prescribing Physician with the trendy drugs. It was to establish which Drug can be given to the affected person foundation the Prognosis, Age, and Gender column.
Key Takeaways
- We thought-about the highest 150 phrases/bigrams/trigrams from the Prognosis column for the modeling train. Let’s say we dissected the analysis column and created 150 options.
- We had the Age and Gender function as properly. So, in complete, we had 152 options.
- Utilizing these 152 options, we tried to find out which drugs/drug can be prescribed.
- The baseline/random mannequin was 16.67% correct. Our mannequin was 41.11% correct on prepare information and 38.32% on take a look at information. That could be a vital enchancment over the baseline mannequin.
Some issues to check out to enhance the mannequin efficiency:
- Strive TF-IFD or embeddings to extract options.
- Strive totally different fashions and use grid search and cross-validation to optimize accuracy.
- Strive predicting extra medicines.
Thanks for studying my article. Be happy to attach with me on LinkedIn to debate this.
The media proven on this article isn’t owned by Analytics Vidhya and is used on the Creator’s discretion.