I was given a Python Code Challenge, that was successfully solved using Python with Panda libraries. However, I'm certain the challenge can also be resolved with T-SQL.
The python challenge was based on Train Test Split.
The challenge was to sample and randomly split data to create training and test datasets.
You had to create a function 'trainiTestSplit(df) that achieves the following:
- Samples 20 Percent of the dataset without replacement
- Randomly splits the sampled data to create train and test datasets with weights 0.8 and 0.2 respectively.
- Uses the summary function on the training dataset to return a dataset of the statistics mean, min, max
- Returns, in order the training dataset, test, dataset and the summary dataset
For those of you interested in the python code to achieve the above is as follows:
df=data.sample(frac=0.2) # sample data print( df.head()) # added to show first few elements. print(len(df)) # sampled size # split data, still no changes... def trainTestSplit(df): train=df.sample(frac=0.8) test=df.drop(train.index) summary=train.describe(include=[np.number]).loc[['mean','min','max'],:] return train,test,summary (trf, tsf, sum) = trainTestSplit(df) # run the split function print(trf) # print result-sets - this will show only part of the data as it is quite a huge array (160 or so elements, 40- ish shown). print(tsf) print(sum)
The table is as follows:
CREATE TABLE SparkDefinitive.dbo.CC1_TrainTestSpit ( discountId FLOAT NULL ,price FLOAT NULL ,active BIT NULL ,createdAt FLOAT NULL ) ON [PRIMARY] GO
I have attached an image of some records from the dataset
The question basically, can someone achieve the same result using T-SQL?