Loading

Data Purchasing Challenge 2022

Baseline + Exploration: random purchase vs full purchase

Exploration: random purchase vs full purchase

moto

Exploration: random purchase vs full purchase

I used the same code for my baseline at https://gitlab.aicrowd.com/moto/data-purchasing-hello/tree/submission-v0p1p5

Introduction

This competition will be hard. Here is my first experiment outcomes:

  • Random prediction got 0.478 acc
  • Models with training set go the same acc
  • Models with 3K random purchase got 0.67 acc
  • Models with 10K (all purchase) got 0.69 acc

Data Downloading

In [1]:
from IPython.display import clear_output

root_dir = "/content"
!cd $root_dir

!gdown --id  1hOBoA1gsUSqGMjqKVjVL00X4MPA3WgSy
!gdown --id 1ks-qMyqh5rnrqmkbFQiMwk7BXTx32h6D
!gdown --id 1vUX2rAKg9A2CRUZXWq9vyK5-j8CXyegt
clear_output()

!tar -xvf training.tar.gz
!tar -xvf unlabelled.tar.gz
!tar -xvf validation.tar.gz
clear_output()

!cd $root_dir
In [2]:
!mkdir -p $root_dir/input
!cd $root_dir
!mv training* $root_dir/input
!mv unlabelled* $root_dir/input
!mv validation* $root_dir/input

Libraries

In [3]:
!python -c "import monai" || pip install -q "monai-weekly[pillow, tqdm]"
!python -c "import matplotlib" || pip install -q matplotlib
clear_output()
%matplotlib inline
In [4]:
import os
import shutil
import tempfile
import matplotlib.pyplot as plt
import PIL
import torch
import numpy as np
from sklearn.metrics import classification_report

from monai.apps import download_and_extract
from monai.config import print_config
from monai.data import decollate_batch
from monai.metrics import ROCAUCMetric

print_config()
MONAI version: 0.9.dev2206
Numpy version: 1.19.5
Pytorch version: 1.10.0+cu111
MONAI flags: HAS_EXT = False, USE_COMPILED = False
MONAI rev id: 1cd65b599e1439e3507a05197c3a290b9aca9305
MONAI __file__: /usr/local/lib/python3.7/dist-packages/monai/__init__.py

Optional dependencies:
Pytorch Ignite version: NOT INSTALLED or UNKNOWN VERSION.
Nibabel version: 3.0.2
scikit-image version: 0.18.3
Pillow version: 7.1.2
Tensorboard version: 2.7.0
gdown version: 4.2.1
TorchVision version: 0.11.1+cu111
tqdm version: 4.62.3
lmdb version: 0.99
psutil version: 5.4.8
pandas version: 1.3.5
einops version: NOT INSTALLED or UNKNOWN VERSION.
transformers version: NOT INSTALLED or UNKNOWN VERSION.
mlflow version: NOT INSTALLED or UNKNOWN VERSION.

For details about installing the optional dependencies, please visit:
    https://docs.monai.io/en/latest/installation.html#installing-the-recommended-dependencies

Data Prepraration

In [5]:
import pandas as pd
import numpy as np
In [6]:
all_classes = ['scratch_small', 'scratch_large', 'dent_small', 'dent_large']

def get_frame(filename):
    df = pd.read_csv(filename)
    pos = [df[df[c] == 1].shape[0] for c in all_classes]
    pos += [df.shape[0]]

    image_dir = "/".join(filename.split("/")[:-1]) + "/images/"
    df["filepath"] = df["filename"].apply(lambda s: image_dir + s)

    print(filename, df.shape)
    print("count:", pos)
    return df, pos

df_train, pos_train = get_frame(f"{root_dir}/input/training/labels.csv")
df_train.head()
/content/input/training/labels.csv (5000, 6)
count: [1273, 855, 1226, 589, 5000]
Out[6]:
filename scratch_small scratch_large dent_small dent_large filepath
0 002YxUqF3Q.png 0 0 0 0 /content/input/training/images/002YxUqF3Q.png
1 00Fo8XYcvC.png 0 0 0 0 /content/input/training/images/00Fo8XYcvC.png
2 02s1G8Wwg8.png 0 0 0 0 /content/input/training/images/02s1G8Wwg8.png
3 035EI0mrFh.png 0 0 0 0 /content/input/training/images/035EI0mrFh.png
4 0385gp8ksf.png 1 0 0 0 /content/input/training/images/0385gp8ksf.png
In [7]:
df_val, pos_val = get_frame(f"{root_dir}/input/validation/labels.csv")
df_val.head()
/content/input/validation/labels.csv (3000, 6)
count: [788, 538, 680, 348, 3000]
Out[7]:
filename scratch_small scratch_large dent_small dent_large filepath
0 000XeKZjyo.png 0 0 1 0 /content/input/validation/images/000XeKZjyo.png
1 01g3s4Cps8.png 1 0 0 0 /content/input/validation/images/01g3s4Cps8.png
2 039iQ7uPxX.png 0 0 0 0 /content/input/validation/images/039iQ7uPxX.png
3 05dwOdxVfr.png 1 0 0 0 /content/input/validation/images/05dwOdxVfr.png
4 06JHQtyeMm.png 0 0 0 0 /content/input/validation/images/06JHQtyeMm.png
In [8]:
df_unlabelled, pos_unlabelled = get_frame(f"{root_dir}/input/unlabelled/labels.csv")
df_unlabelled.head()
/content/input/unlabelled/labels.csv (10000, 6)
count: [2607, 1784, 2281, 1148, 10000]
Out[8]:
filename scratch_small scratch_large dent_small dent_large filepath
0 00O1rwvydO.png 0 0 0 0 /content/input/unlabelled/images/00O1rwvydO.png
1 01bB3DVokm.png 1 0 0 0 /content/input/unlabelled/images/01bB3DVokm.png
2 01sAbrP4Gm.png 0 0 0 0 /content/input/unlabelled/images/01sAbrP4Gm.png
3 02XiKLPuxY.png 0 0 0 0 /content/input/unlabelled/images/02XiKLPuxY.png
4 02bVIX0aMT.png 0 1 0 0 /content/input/unlabelled/images/02bVIX0aMT.png

Let's see the distribution - seems all sets have same distribution

In [9]:
rows = []
for name, pos in zip(["training", "validation", "unlabelled"],
                      [pos_train, pos_val, pos_unlabelled]):
    for col, p in zip(all_classes + ["total"], pos):
        rows.append((name, col, p))
df_count = pd.DataFrame(rows, columns=["set", "class", "count"])
df_count

df_count.pivot(index='set', columns='class', values='count').plot(kind='bar', figsize=(10,5))
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f8f01973b90>

Let's see images

In [10]:
plt.subplots(4, 4, figsize=(16, 16))

train_dir = f"{root_dir}/input/training/images"

def get_text_lables(row):
    label = [c for c in all_classes if row[c] > 0]
    if len(label) > 0:
        return ",".join(label)
    else:
        return "negative"

for i, row in df_train.sample(16).reset_index(drop=True).iterrows():
    filename = row["filename"]
    im = PIL.Image.open(f"{train_dir}/{filename}")
    arr = np.array(im)
    plt.subplot(4, 4, i + 1)
    plt.xlabel(get_text_lables(row), fontsize=18)
    plt.imshow(arr, cmap="gray", vmin=0, vmax=255)
plt.tight_layout()
plt.show()