Resolve a Python Challenge with SQL

carltonp · September 23, 2019, 7:42pm

Hello community

I was given a Python Code Challenge, that was successfully solved using Python with Panda libraries. However, I'm certain the challenge can also be resolved with T-SQL.

The python challenge was based on Train Test Split.

The challenge was to sample and randomly split data to create training and test datasets.
You had to create a function 'trainiTestSplit(df) that achieves the following:

Samples 20 Percent of the dataset without replacement
Randomly splits the sampled data to create train and test datasets with weights 0.8 and 0.2 respectively.
Uses the summary function on the training dataset to return a dataset of the statistics mean, min, max
Returns, in order the training dataset, test, dataset and the summary dataset

For those of you interested in the python code to achieve the above is as follows:

df=data.sample(frac=0.2)   # sample data
print( df.head())          # added to show first few elements.
print(len(df))            # sampled size

# split data, still no changes...

def trainTestSplit(df):
    train=df.sample(frac=0.8)
    test=df.drop(train.index)

    summary=train.describe(include=[np.number]).loc[['mean','min','max'],:]

    return train,test,summary

(trf, tsf, sum) = trainTestSplit(df)      # run the split function

print(trf)     # print result-sets   - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown).
print(tsf)
print(sum)

The table is as follows:

CREATE TABLE SparkDefinitive.dbo.CC1_TrainTestSpit (
  discountId FLOAT NULL
 ,price FLOAT NULL
 ,active BIT NULL
 ,createdAt FLOAT NULL
) ON [PRIMARY]
GO

I have attached an image of dataset some records from the dataset

The question basically, can someone achieve the same result using T-SQL?

Thanks