Data Purchasing Challenge 2022
Baseline + Exploration: random purchase vs full purchase
Exploration: random purchase vs full purchase
Exploration: random purchase vs full purchase
I used the same code for my baseline at https://gitlab.aicrowd.com/moto/data-purchasing-hello/tree/submission-v0p1p5
Introduction¶
This competition will be hard. Here is my first experiment outcomes:
- Random prediction got 0.478 acc
- Models with training set go the same acc
- Models with 3K random purchase got 0.67 acc
- Models with 10K (all purchase) got 0.69 acc
Data Downloading¶
In [1]:
from IPython.display import clear_output
root_dir = "/content"
!cd $root_dir
!gdown --id 1hOBoA1gsUSqGMjqKVjVL00X4MPA3WgSy
!gdown --id 1ks-qMyqh5rnrqmkbFQiMwk7BXTx32h6D
!gdown --id 1vUX2rAKg9A2CRUZXWq9vyK5-j8CXyegt
clear_output()
!tar -xvf training.tar.gz
!tar -xvf unlabelled.tar.gz
!tar -xvf validation.tar.gz
clear_output()
!cd $root_dir
In [2]:
!mkdir -p $root_dir/input
!cd $root_dir
!mv training* $root_dir/input
!mv unlabelled* $root_dir/input
!mv validation* $root_dir/input
Libraries¶
In [3]:
!python -c "import monai" || pip install -q "monai-weekly[pillow, tqdm]"
!python -c "import matplotlib" || pip install -q matplotlib
clear_output()
%matplotlib inline
In [4]:
import os
import shutil
import tempfile
import matplotlib.pyplot as plt
import PIL
import torch
import numpy as np
from sklearn.metrics import classification_report
from monai.apps import download_and_extract
from monai.config import print_config
from monai.data import decollate_batch
from monai.metrics import ROCAUCMetric
print_config()
Data Prepraration¶
In [5]:
import pandas as pd
import numpy as np
In [6]:
all_classes = ['scratch_small', 'scratch_large', 'dent_small', 'dent_large']
def get_frame(filename):
df = pd.read_csv(filename)
pos = [df[df[c] == 1].shape[0] for c in all_classes]
pos += [df.shape[0]]
image_dir = "/".join(filename.split("/")[:-1]) + "/images/"
df["filepath"] = df["filename"].apply(lambda s: image_dir + s)
print(filename, df.shape)
print("count:", pos)
return df, pos
df_train, pos_train = get_frame(f"{root_dir}/input/training/labels.csv")
df_train.head()
Out[6]:
In [7]:
df_val, pos_val = get_frame(f"{root_dir}/input/validation/labels.csv")
df_val.head()
Out[7]:
In [8]:
df_unlabelled, pos_unlabelled = get_frame(f"{root_dir}/input/unlabelled/labels.csv")
df_unlabelled.head()
Out[8]:
Let's see the distribution - seems all sets have same distribution¶
In [9]:
rows = []
for name, pos in zip(["training", "validation", "unlabelled"],
[pos_train, pos_val, pos_unlabelled]):
for col, p in zip(all_classes + ["total"], pos):
rows.append((name, col, p))
df_count = pd.DataFrame(rows, columns=["set", "class", "count"])
df_count
df_count.pivot(index='set', columns='class', values='count').plot(kind='bar', figsize=(10,5))
Out[9]:
Let's see images¶
In [10]:
plt.subplots(4, 4, figsize=(16, 16))
train_dir = f"{root_dir}/input/training/images"
def get_text_lables(row):
label = [c for c in all_classes if row[c] > 0]
if len(label) > 0:
return ",".join(label)
else:
return "negative"
for i, row in df_train.sample(16).reset_index(drop=True).iterrows():
filename = row["filename"]
im = PIL.Image.open(f"{train_dir}/{filename}")
arr = np.array(im)
plt.subplot(4, 4, i + 1)
plt.xlabel(get_text_lables(row), fontsize=18)
plt.imshow(arr, cmap="gray", vmin=0, vmax=255)
plt.tight_layout()
plt.show()