This is a detailed EDA of the data, shown in the second video of “Exploratory data analysis” lecture (week 2).
PLEASE NOTE: the dataset cannot be published, so this notebook is readonly.
Load data
In this competition hosted by solutions.se, the task was to predict the advertisement cost for a particular ad.


Let’s look at the data (notice that the table is transposed, so we can see all feature names).


0  1  2  3  4  

AdGroupId  78db034136  68a0110c69  21af1035af  f63fda0c33  cd868ebdcc 
AdGroupName  6d91d 25866 9c594  2657d cb2d0 6d91d  6d91d e33a0 9a99b  59991 9c594  6d91d 25866 9a99b 
AdNetworkType2  s  s  s  s  s 
AveragePosition  1.2  2  1  1  1.1 
CampaignId  273823cb71  273823cb71  273823cb71  273823cb71  273823cb71 
CampaignName  2657d 16cb2 74532 b4842 0136e 35aca f140d  2657d 16cb2 74532 b4842 0136e 35aca f140d  2657d 16cb2 74532 b4842 0136e 35aca f140d  2657d 16cb2 74532 b4842 0136e 35aca f140d  2657d 16cb2 74532 b4842 0136e 35aca f140d 
Clicks  0  0  0  0  3 
Conversions  0  0  0  0  0 
ConversionsManyPerClick  0  0  0  0  0 
Cost  0  0  0  0  0.94 
Date  20140101 00:00:00  20140101 00:00:00  20140101 00:00:00  20140101 00:00:00  20140101 00:00:00 
DestinationUrl  98035d60fc  c25f23cd08  01f87f7639  5c0e89f532  8888b55dde 
Device  t  t  t  d  d 
FirstPageCpc  1.06  2.94  0.42  1.75  0.17 
Impressions  32  1  4  1  22 
KeywordMatchType  b  b  b  b  b 
KeywordText  jze 10 +uxsgk  +jze +dznvgyhjclr  jze 100 +gzpxyk  jze 10 +uxsgk 1950k  jze 10 mykj +gzpxyk 
MaxCpc  0.28  1  0.22  0.54  0.12 
QualityScore  1  1  1  1  1 
Slot  s_2  s_2  s_1  s_2  s_1 
TopOfPageCpc  1.07  5.02  0.42  4  0.25 
KeywordId  7d20d63df9  a617d4f037  6e0b7024d2  9c2ea0cdf8  4c8ba7affd 
We see a lot of features with not obvious names. If you search for the CampaignId, AdGroupName, AdNetworkType2 using any web search engine, you will find this dataset was exported from Google AdWords. So what is the required domain knowledge here? The knowledge of how web advertisement and Google AdWords work! After you have learned it, the features will make sense to you and you can proceed.
For the sake of the story I will briefly describe Google AdWords system now. Basically every time a user queries a search engine, Google AdWords decides what ad will be shown along with the actual search results. On the other side of AdWords, the advertisers manage the ads – they can set a multiple keywords, that a user should query in order to their ad to be shown. If the keywords are set properly and are relevant to the ad, then the ad will be shown to relevant users and the ad will get clicked. Advertisers pay to Google for some type of events, happened with their ad: for example for a click event, i.e. the user saw this ad and clicked it. AdWords uses complex algorithms to decide which ad to show to a particular user with a particular search query. The advertisers can only indirectly influence AdWords decesion process by changing keywords and several other parameters. So at a high level, the task is to predict what will be the costs for the advertiser (how much he will pay to Google, column Cost) when the parameters (e.g. keywords) are changed.
The ads are grouped in groups, there are features AdGroupId AdGroupName describing them. A campaign corresponds to some specific parameters that an advertiser sets. Similarly, there are ID and name features CampaignId, CampaignName. And finally there is some information about keywords: KeywordId and KeywordText. Slot is $1$ when ad is shown on top of the page, and $2$ when on the side. Device is a categorical variable and can be either “tablet”, “mobile” or “pc”. And finally the Date is just the date, for which clicks were aggregated.


0  1  2  3  4  

Id  0  1  2  3  4 
AdGroupId  00096e7611  00096e7611  00096e7611  00096e7611  00096e7611 
AdGroupName  c8037 75b01 9a99b 3b678 52ba4 2657d  c8037 75b01 9a99b 3b678 52ba4 2657d  c8037 75b01 9a99b 3b678 52ba4 2657d  c8037 75b01 9a99b 3b678 52ba4 2657d  c8037 75b01 9a99b 3b678 52ba4 2657d 
AdNetworkType2  s  s  s  s  s 
AveragePosition  1  1  1  1  1 
CampaignId  e62b4bc4c3  e62b4bc4c3  e62b4bc4c3  e62b4bc4c3  e62b4bc4c3 
CampaignName  2657d 16cb2 74532 06feb 0136e 3a15d  2657d 16cb2 74532 06feb 0136e 3a15d  2657d 16cb2 74532 06feb 0136e 3a15d  2657d 16cb2 74532 06feb 0136e 3a15d  2657d 16cb2 74532 06feb 0136e 3a15d 
Date  20140601 00:00:00  20140601 00:00:00  20140601 00:00:00  20140601 00:00:00  20140601 00:00:00 
DestinationUrl  f5aad09031  f5aad09031  f5aad09031  f5aad09031  f5aad09031 
Device  t  d  m  t  d 
KeywordId  539778bb80  539778bb80  539778bb80  539778bb80  539778bb80 
KeywordMatchType  e  e  e  e  e 
KeywordText  tcjnw gzpxyk nyss ewzhy  tcjnw gzpxyk nyss ewzhy  tcjnw gzpxyk nyss ewzhy  tcjnw gzpxyk nyss ewzhy  tcjnw gzpxyk nyss ewzhy 
Slot  s_1  s_1  s_1  s_2  s_2 
Notice there is diffrent number of columns in test and train – our target is Cost column, but it is closly related to several other features, e.g. Clicks, Conversions. All of the related columns were deleted from the test set to avoid data leakages.
Let’s analyze
Are we ready to modeling? Not yet. Take a look at this statistic:


Train min/max date: 20140101 / 20140531
Test min/max date: 20140601 / 20140614
Number of days in train: 151
Number of days in test: 14
Train shape: 3493820 rows
Test shape: 8951040 rows
Train period is more than 10 times larger than the test period, but train set has fewer rows, how could that happen?
At this point I suggest you to stop and think yourself, what could be a reason, why this did happen. Unfortunately we cannot share the data for this competition, but the information from above should be enough to get a right idea.
Alternatively, you can go along for the explanation, if you want.
Investigation
Let’s take a look how many rows with each date we have in train and test.


20140602 639360
20140612 639360
20140609 639360
20140614 639360
20140601 639360
20140611 639360
20140608 639360
20140605 639360
20140610 639360
20140607 639360
20140604 639360
20140606 639360
20140603 639360
20140613 639360
Name: Date, dtype: int64


20140101 36869
20140104 36427
20140105 36137
20140102 34755
20140103 34693
20140106 31349
20140407 30950
20140209 30101
20140126 29830
20140208 29187
Name: Date, dtype: int64
Interesting, for the test set we have the same number of rows for every date, while in train set the number of rows is different for each day. It looks like that for each day in the test set a loop through some kind of IDs had been run. But what about train set? So far we don’t know, but let’s find the test IDs first.
Test
So now we know, that there is $639360$ different IDs. It should be easy to find the columns, that form ID, because if the ID is [‘col1’, ‘col2’], then to compute the number of combinations we should just multiply the number of unique elements in each.


Id 8951040
AdGroupId 13548
AdGroupName 2281
AdNetworkType2 2
AveragePosition 131
CampaignId 252
CampaignName 252
Date 14
DestinationUrl 52675
Device 3
KeywordId 12285
KeywordMatchType 3
KeywordText 11349
Slot 4
dtype: int64


Nothing found
Hmm, nothing found! The problem is that some features are tied, and the number of their combinations does not equal to product of individual unique number of elements. For example it does not make sense to create all possible combinations of DestinationUrl and AdGroupId as DestinationUrl belong to exactly one AdGroupId.


DestinationUrl
00010d62df 1
000249f717 1
00054cf3f8 1
000684bf0b 1
00072a9fa7 1
00077a6729 1
0007cc191f 1
0009388900 1
001144cae4 1
00115f6477 1
00141a299f 1
00169dc49b 1
0018b27e06 1
001b0b3d06 1
001ef8368e 1
00205e056a 1
002082ab8b 1
0020c585ea 1
0021419f7e 1
00225519cc 1
002498dc88 1
0026171436 1
00265dc4bb 1
0026833e5c 1
0027ffbad9 1
002b1deb25 1
002c55ccef 1
002e44290f 1
0030ca870e 1
0032b64beb 1
..
ffda377018 1
ffda3c412a 1
ffda5b53d6 1
ffda8c0d8c 1
ffdbf5d179 1
ffdc872fcf 1
ffde114af5 1
ffde41a800 1
ffe2fb7007 1
ffe4a040d4 1
ffe685e937 1
ffe8c3da53 1
ffe8f82e08 1
ffeb9fda9d 1
ffebd1d253 1
ffebea724f 1
ffecf398b1 1
ffecf3e7d4 1
ffed185438 1
fff02d7269 1
fff10adcb0 1
fff12e5f19 1
fff132d5bd 1
fff19836a0 1
fff3539204 1
fff4c5d255 1
fff55db78a 1
fff8c11ad9 1
fff90ea351 1
fffb248bf0 1
Name: AdGroupId, Length: 52675, dtype: int64
So, now let’s try to find ID differently. Let’s try to find a list of columns, such that threre is exazctly $639360$ unique combinations of their values in the test set (not overall). So, we want to find columns
, such that:


We could do it with a similar loop.


But it will take forever to compute. So it is easier to find the combination manually.
So after some time of trials and errors I figured out, that the four features KeywordId, AdGroupId, Device, Slot form the index. The number of unique rows is exactly 639360 as we wanted to find.


(639360, 4)
Looks reasonable. For each AdGroupId there is a distinct set of possible KeywordId’s, but Device and Slot variants are the same for each ad. And the target is to predict what will be the daily cost for using different KeywordId’s, Device type, Slot type to advertise ads from AdGroups.
Train
To this end, we found how test set was constructed, but what about the train set? Let us plot something, probably we will find it out.






Looks interesting, isn’t it? That is something we need to explain! How the same plot looks for the test set?




Just a dot!
Now let’s think, what we actually plotted? We grouped the data by the ID that we’ve found previously and we plotted average Date in the group versus the size of each group. We found that ID is an aggregation index – so for each date the Cost is aggreagated for each possible index. So group size shows for how many days we have Const information for each ID and mean relative date shows some information about these days.
For test set it is expectable that both average date and the size of the groups are the same for each group: the size of each group is $14$ (as we have $14$ test days) and mean date is $6.5$, because for each group (index) we have $14$ different days, and $\frac{0 + 1 + \dots + 13}{14} = 6.5$.
And now we can explain everything for the train set. Look at the top of the triangle: for those points (groups) we have Cost information for all the days in the train period, while on the sides we see groups, for which we have very few rows.
But why for some groups we have smaller number of rows, than number of days? Let’s look at the Impressions column.


1 1602929
2 565896
3 287128
4 175197
5 119092
6 86651
7 66443
8 53007
9 42984
10 35731
11 30248
12 25950
13 22629
14 20126
15 17503
16 15682
17 14100
18 12848
19 11597
20 10724
21 9864
22 8931
23 8316
24 7953
25 7168
26 6684
27 6196
28 5863
29 5556
30 5223
...
4978 1
15210 1
9076 1
13174 1
116535 1
4979 1
17273 1
90974 1
4976 1
5906 1
7023 1
60282 1
7955 1
13881 1
2921 1
4970 1
7019 1
17249 1
23394 1
28210 1
11116 1
15929 1
7017 1
95761 1
2923 1
15213 1
9070 1
5692 1
13162 1
13922 1
Name: Impressions, Length: 8135, dtype: int64
We never have $0$ value in Imressions column. But in reality, of course, some ads with some combination of keyword, slot, device were never shown. So this looks like a nice explanation for the data: in the train set we only have information about ads (IDs, groups) which were shown at least once. And for the test set, we, of course, want to predict Cost for every possible ID.
What it means for competitors, is that if one would just fit a model on the train set as is, the predictions for the test set will be biased by a lot. The predictions will be much higher than they should be, as we are only given a specific subset of rows as train.csv
file.
So, before modeling we should first extend the trainset and inject rows with 0
impressions. Such change will make train set very similar to the test set and the models will generalize nicely.