Help with some quite simple SQL questions

Hey guys,

I'm having some problems with SQL. Im currently working in ProgresQL.

I have two tables with the following columns:

table case_data_order (
id numeric, -- An id for a single purchase
create_timestamp timestamp, -- The timestamp of when the purchase was made, in CET
cid numeric, -- An id for the customer who made the purchase
purchase_amount bigint, -- The amount of the purchase, in cents
currency text, -- The currency of the purchase
birth_year timestamp, -- The birth year of the customer, NULL for company purchases
gender text, -- The gender of the customer, none for company purchases
merchant_id bigint, -- An id for the merchant the purchase was made at
country text, -- The country of the customer
device text -- The device the purchase was made with, NULL if not known

table case_data_merchant (
merchant_id bigint, -- An id for the merchant, can be used to connect to orders
merchant_group text, -- The group of the merchant belongs to, is the Mechant Segment together with merchant_category
merchant_category text -- The category the merchant belongs to, is the Mechant Segment together with merchant_group

These are the questions I still haven't been able to answer. Ive been trying a lot of commands, but would greatly appreciate if someone could help me out with these remaining questions.

Which merchant, with more than 10 000 sales during 2014, had the best growth in quantity of transactions between the years 2014-2015?

In wich segment was the merchant who had the highest quantity of sales during 2015?

Which segment, with more than 1 000 000 transactions, had the highest share of female purchasers?

Which merchant, in the segment (merchant_group) Entertainment and with more than 10 000 sales during 2014, had the highest quantity of sales during 2015?

During which time interval, counted as a whole hour (13:00-13:59), occurred the highest number of purchases for the age group 18-25 during 2015 on a Tablet in the segment (merchant_group) Entertainment?

Best Regards

Pixxco

This is a Microsoft SQL Server forum, so you may not find folk here who are skilled in PostgreSQL

Your questions look like homework? TO get held on homework you will need to post whatever work you ave tried so far. You won't learn anything if we do homework for you, but we are happy to help once we see what you have tried already.

I see, thanks for the information.

This is not homework, my friend have a database which he wanted to know a few things about it and I said I could try help him out and learn a bit SQL.

Regarding this question:
Which merchant, with more than 10 000 sales during 2014, had the best growth in quantity of transactions between the years 2014-2015?
I have tried the following:

SELECT merchant_id, COUNT(id)
FROM case_data_order
WHERE create_timestamp BETWEEN '2014-01-01' AND '2014-12-31'
GROUP BY merchant_id
HAVING COUNT(id) > 10000;

however, I'm unsure if this is the right command since the database is quite big. And furthermore, this only gives me the merchants with more than 10 000 sales during 2014 and their merchant id. Im having problems figuring out how I can get the sales numbers for 2015 at the same time and then combine it with max("2015 count(id)" - 2014 count(id)").

My hope was just finding someone who could help me out, and then learn by discussing and asking questions. Not just get the whole answer and be done with it :slight_smile:

Great :slight_smile: But as I don't want to pin out every single detail (that you probably already know), I'll give you something to work with and you ask questions where you don't understand what I'm doing.

As @Kristen wrote, this forum is for MSSql so my suggestions might fail in syntax.

Here we go

select merchant_id
      ,sum(case when datepart(year,create_timestamp)=2014 then 1 else 0 end) as y2014
      ,sum(case when datepart(year,create_timestamp)=2015 then 1 else 0 end) as y2015
      ,sum(case when datepart(year,create_timestamp)=2015 then 1 else -1 end) as growth
  from case_data_order
 where create_timestamp>=cast('2014-01-01' as datetime)
   and create_timestamp<cast('2016-01-01' as datetime)
 group by merchant_id
 having sum(case when datepart(year,create_timestamp)=2014 then 1 else 0 end)>10000
 order by growth desc
;

As there is no quantity, I assume we want quantity of transactions (as previous query):

select a.merchant_id
      ,b.merchant_group
      ,a.qty
  from (select merchant_id
              ,count(*) as qty
          from case_data_order
         where create_timestamp>=cast('2015-01-01' as datetime)
           and create_timestamp<cast('2016-01-01' as datetime)
         group by merchant_id
         order by qty desc
         limit 1
       ) as a
       inner join case_data_merchant as b
               on b.merchantid=a.merchantid
;
select b.merchant_group
      ,sum(case when a.gender='female' then 1 else 0 end) as female_qty
  from case_data_order as a
       inner join case_data_merchant as b
               on b.merchantid=a.merchantid
 group by b.merchant_group
 having count(*)>1000000
 order by female_qty desc
 limit 1
;
select a.merchantid
      ,sum(case when datepart(year,a.create_timestamp)=2015 then 1 else 0 end) as qty
  from case_data_order as a
       inner join case_data_merchant as b
               on b.merchantid=a.merchantid
              and b.merchant_group='Entertainment'
 where create_timestamp>=cast('2014-01-01' as datetime)
   and create_timestamp<cast('2016-01-01' as datetime)
 group by a.merchantid
 having sum(case when datepart(year,a.create_timestamp)=2014 then 1 else 0 end)>10000
 order by qty desc
 limit 1
;
select datepart(hour,a.create_timestamp) as h
      ,count(*) as qty
  from case_data_order as a
       inner join case_data_merchant as b
               on b.merchantid=a.merchantid
              and b.merchant_group='Entertainment'
 where create_timestamp>=cast('2015-01-01' as datetime)
   and create_timestamp<cast('2016-01-01' as datetime)
   and a.device='Tablet'
   and date_part(year,age(a.birth_year,a.create_timestamp)) between 18 and 25
 group by datepart(hour,a.create_timestamp)
 order by qty desc
 limit 1
;

Wow, thanks a lot bitsmed.

I haven't got the time to go through all your suggestions yet, but the main structure seems to be right, and that can at least help me find the right commands and especially the right line of thinking. Greatly appreciated.

Quick question though, what if I would like to combine merchant_group with merchant_category and create a new column called merchant segment? Because, currently your suggestions are calculating the the segment of only the merchant group column. I would like to specify this column by joining it together with merchant category.

I hope you understand what I mean.

And thanks again!

On MSSql we would do:

select merchant_group+' '+merchant_category as merchant_segment

On PostgreSql I belive you'd do:

select merchant_group||' '||merchant_category as merchant_segment

Okay, thanks a lot mate! I'll work on it but at least I have something to go on :smiley:

Hi again guys..

Im having some problems with the question for which merchant segment that has the highest share of females. My problem is that the column cid gives me every unique person in the database, where each row represents a transaction.

select b.merchant_group
,sum(case when a.gender='female' then 1 else 0 end) as female_qty
from cdo2 as a
inner join cdm2 as b
on b.merchant_id=a.merchant_id
group by b.merchant_group
having count(*)>10
order by female_qty desc
;

This code gives me the merchant segment with the most transaction that "is marked as female", however each person (cid) could have made multiple transactions and therefor the female quantity isn't the right answer. How can I get the quantity of females corresponding to the distinct number of persons in the merchant segment?

Sorry for the unclarity of the question, Im trying to explain as well as possible.

Best regards
and thanks again

Pixxco

Could this be that you're looking for:

select b.merchant_group
      ,count(distinct a.cid) as female_qty
  from cdo2 as a
       inner join cdm2 as b
               on b.merchant_id=a.merchant_id
 where a.gender='female'
 group by b.merchant_group
 having count(distinct a.cid)>10
 order by female_qty desc
;

Ah, i see. You select every "unique person" and filter the male ones out by using where at the end of the command.

select b.merchant_group
,count(distinct a.cid) as female_qty
from case_data_order as a
inner join case_data_merchant as b
on b.merchant_id=a.merchant_id
where a.gender='female'
group by b.merchant_group
having count(*)>1000000
order by female_qty desc
;

The database is quite big so its hard for me to know for sure that this gives me the merchant group with the highest quantity of women. But anyways, what if i would like to calculate the share of women in each group?

count(distinct cid) should give me each unique person in the dataset, and then i could group this by merchant_group to get how many unique persons there are in each merchant group. Hence, my question is how could one divide the column i just created, female_qty, by the total number of unique person in each merchant_group? Im trying to write a command that gives me the merchant_group with the highest share of women.

Your help is much appreciated bitsmed!

And (haha) just a quick question. What does the cast command do? Is it to convert different type of date formats?

This might just work:

with cte1
  as (select b.merchant_group
            ,a.cid
            ,a.gender
            ,count(*) as transactions
        from case_data_order as a
             inner join case_data_merchant as b
                     on b.merchant_id=a.merchant_id
       group by b.merchant_group
               ,a.cid
               ,a.gender
     )
    ,cte2
  as (select merchant_group
        from cte1
       group by merchant_group
       having sum(transactions)>10000000
     )
select a.merchant_group
      ,count(*) as total_qty
      ,sum(case when b.gender='female' then 1 else 0 end) as female_qty
      ,1.        /* this line is to ensure we get decimals */
      *count(*)
      /sum(case when b.gender='female' then 1 else 0 end)
       as female_share
  from cte2 as a
       inner join cte1 as b
               on b.merchant_group=a.merchant_group
 group by merchant_group
 order by b.female_share desc
;

To make your query run as fast as possible, it's best to keep fields in same format - and I'm guessing the create_timestamp is of type datetime.

Hmm okay, after some time I think i get the subqueries.

But at the last stage where we use our two subqueries, the total_qty is based on count(*), which is not the same as persons since one person can do multiple transactions? Or is this problem solved since we in our first subquery specify a.cid? Or do we need to write count(distinct a.cid) here?

This is what i have now:

with sq1
as (select b.merchant_group
,a.cid
,a.gender
,count() as transactions
from case_data_order as a
inner join case_data_merchant as b
on b.merchant_id=a.merchant_id
group by b.merchant_group
,a.cid
,a.gender
)
,sq2
as (select a.merchant_group
from sq1 as a
group by merchant_group
having sum(transactions)>1000000
)
select a.merchant_group
,count(
) as total_qty
,sum(case when b.gender='female' then 1 else 0 end) as female_qty
,(sum(case when b.gender='female' then 1 else 0 end)/round(count(*)))
as female_share
from sq2 as a
inner join sq1 as b
on b.merchant_group=a.merchant_group
group by a.merchant_group
order by female_share desc
;

which gave me the right percentage, given that total qty is unique persons and not every single transaction. I wouldn't have been able to get it right without you though bitsmed! Thank you so much.

Regarding the last question I had, during which time interval there were most transactions made by 1) 18-25 years old, 2) during 2015, 3) on a tablet 4) in merchant group Entertainment, does this look right to you? I guess that maybe by now you have a bit more understanding of my data :slight_smile:

select extract(hour from a.create_timestamp) as h
,count(*) as qty
from case_data_order as a
inner join case_data_merchant as b
on b.merchant_id=a.merchant_id
and b.merchant_group='Entertainment'
where create_timestamp between '2015-01-01' and '2015-12-31'
and birth_year between '1991-01-01' and '1998-12-31'
and a.device='Tablet'
group by h
order by qty desc
;

Thanks a lot mate!