Hello community
I was given a Python Code Challenge, that was successfully solved using Python with Panda libraries. However, I'm certain the challenge can also be resolved with T-SQL.
The python challenge was based on Train Test Split.
The challenge was to sample and randomly split data to create training and test datasets.
You had to create a function 'trainiTestSplit(df) that achieves the following:
- Samples 20 Percent of the dataset without replacement
- Randomly splits the sampled data to create train and test datasets with weights 0.8 and 0.2 respectively.
- Uses the summary function on the training dataset to return a dataset of the statistics mean, min, max
- Returns, in order the training dataset, test, dataset and the summary dataset
For those of you interested in the python code to achieve the above is as follows:
df=data.sample(frac=0.2) # sample data
print( df.head()) # added to show first few elements.
print(len(df)) # sampled size
# split data, still no changes...
def trainTestSplit(df):
train=df.sample(frac=0.8)
test=df.drop(train.index)
summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]
return train,test,summary
(trf, tsf, sum) = trainTestSplit(df) # run the split function
print(trf) # print result-sets - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)
The table is as follows:
CREATE TABLE SparkDefinitive.dbo.CC1_TrainTestSpit (
discountId FLOAT NULL
,price FLOAT NULL
,active BIT NULL
,createdAt FLOAT NULL
) ON [PRIMARY]
GO
I have attached an image of some records from the dataset
The question basically, can someone achieve the same result using T-SQL?
Thanks