Kaggle比赛虾皮网商品匹配大赛多模态基线模型(附代码)
竞赛介绍
Shopee Price Match Guarantee比赛希望我们能够从商品的图片、标题判断哪些是同样的商品
简单来讲,像是如果我在虾皮( http://xiapi.xiapibuy.com)上面搜寻「switch」这个词会出以下页面。
而可以看到上面其实有些是Switch主机,有些是switch+健身环,有些则是保护壳、收纳袋之类的,这次的比赛就是希望能够仅从「图片+商品标题」判断出来哪些是同样的商品,借此shopee能够做出更精准的商品推荐、比价、甚至可能可以做假货分析(同样商品价格落差太大)…等新功能
而实际data如下:
赛题任务分析
里面最重要的就是image、title、label_group这三个feature。
- image : 这个商品的图片名称
- title : 商品的标题
- label_group :商品的类别,也就是我们要预测的target(同一个类别可以有多个商品)
- 而image_phash就是一种基础的图片hashing方法(越相似的图片hashing值会越接近),在这比赛中会是最最最基础的baseline,但是因为大部分人都直接重抽图片Feature,所以image_phash等于废掉。
- 而我们要预测的就是给定一个新的商品(一样包含image、title),找出哪些商品跟他属于一样的类别。
这个比赛最困难的就是如何对image跟title抽取feature
下面是data中的一些图片,可以看出图片的拍摄方法、品质可能差异极大,这也是其中一个对商品图片分类困难点。
而这个比赛的Evaluation方法是F1 Score,因为是标准的衡量方法,这边不赘述。
基于文本图像的多模态商品匹配模型
3.1 导入包
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import cv2, matplotlib.pyplot as plt
from tqdm import tqdm_notebook
import cudf, cuml, cupy
from cuml.feature_extraction.text import TfidfVectorizer
from cuml.neighbors import NearestNeighbors
from PIL import Image
import torch
torch.manual_seed(0)
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.benchmark = True
import torchvision.models as models
import torchvision.trans
forms as transforms
import torchvision.datasets as datasets
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.autograd import Variable
from torch.utils.data.dataset import Dataset
3.2 加载数据
COMPUTE_CV = True
test = pd.read_csv(DATA_PATH + 'test.csv')
if len(test)>3: COMPUTE_CV = False
else: print('this submission notebook will compute CV score, but commit notebook will not')
# COMPUTE_CV = False
if COMPUTE_CV:
train = pd.read_csv(DATA_PATH + 'train.csv')
train['image'] = DATA_PATH + 'train_images/' + train['image']
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
train_gf = cudf.read_csv(DATA_PATH + 'train.csv')
else:
train = pd.read_csv(DATA_PATH + 'test.csv')
train['image'] = DATA_PATH + 'test_images/' + train['image']
train_gf = cudf.read_csv(DATA_PATH + 'test.csv')
print('train shape is', train.shape )
train.head()
有些地方空格不是很明显,大家在打代码的时候注意空格哦!
3.3 基于Resnet18提取图像特征
以下为提取商品图片图像特征的模块
class ShopeeImageEmbeddingNet(nn.Module):
def __init__(self):
super(ShopeeImageEmbeddingNet, self).__init__()
model = models.resnet18(True)
model.avgpool = nn.AdaptiveMaxPool2d(output_size=(1, 1))
model = nn.Sequential(*list(model.children())[:-1])
model.eval()
self.model = model
def forward(self, img):
out = self.model(img)
return out
把每张图片的图像特征存储起来
DEVICE = 'cuda'
imgmodel = ShopeeImageEmbeddingNet()
imgmodel = imgmodel.to(DEVICE)
imagefeat = []
with torch.no_grad():
for data in tqdm_notebook(imageloader):
data = data.to(DEVICE)
feat = imgmodel(data)
feat = feat.reshape(feat.shape[0], feat.shape[1])
feat = feat.data.cpu().numpy()
imagefeat.append(feat)
3.4 基于KNN算法构建图像匹配的候选结果
KNN = 50
if len(test)==3: KNN = 2
model = NearestNeighbors(n_neighbors=KNN)
model.fit(imagefeat)
preds = []
CHUNK = 1024*4
imagefeat = cupy.array(imagefeat)
print('Finding similar images...')
CTS = len(imagefeat)//CHUNK
if len(imagefeat)%CHUNK!=0: CTS += 1
for j in range( CTS ):
a = j*CHUNK
b = (j+1)*CHUNK
b = min(b, len(imagefeat))
print('chunk',a,'to',b)
distances = cupy.matmul(imagefeat, imagefeat[a:b].T).T
# distances = np.dot(imagefeat[a:b,], imagefeat.T)
for k in range(b-a):
IDX = cupy.where(distances[k,]>0.95)[0]
# IDX = np.where(distances[k,]>0.95)[0][:]
o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
preds.append(o)
# del imagefeat, imgmodel
3.5 基于Tfidf向量与余弦相似度提取候选结果
preds = []
CHUNK = 1024*4
print('Finding similar titles...')
CTS = len(train)//CHUNK
if len(train)%CHUNK!=0: CTS += 1
for j in range( CTS ):
a = j*CHUNK
b = (j+1)*CHUNK
b = min(b,len(train))
print('chunk',a,'to',b)
# COSINE SIMILARITY DISTANCE
# cts = np.dot( text_embeddings, text_embeddings[a:b].T).T
cts = cupy.matmul(text_embeddings, text_embeddings[a:b].T).T
for k in range(b-a):
# IDX = np.where(cts[k,]>0.7)[0]
IDX = cupy.where(cts[k,]>0.7)[0]
o = train.iloc[cupy.asnumpy(IDX)].posting_id.values
preds.append(o)
del model, text_embeddings
3.6 合并图像和文本的两种结果
def combine_for_sub(row):
x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
return ' '.join( np.unique(x) )
def combine_for_cv(row):
x = np.concatenate([row.oof_text,row.oof_cnn, row.oof_hash])
return np.unique(x)
if COMPUTE_CV:
tmp = train.groupby('label_group').posting_id.agg('unique').to_dict()
train['target'] = train.label_group.map(tmp)
train['oof'] = train.apply(combine_for_cv,axis=1)
train['f1'] = train.apply(getMetric('oof'),axis=1)
print('CV Score =', train.f1.mean() )
train['matches'] = train.apply(combine_for_sub,axis=1)
好了!今天的kaggle比赛的实战案例就分享到这里!完整代码在公众号暗号是“kaggle21”,(●'◡'●)你懂的。
如果你想要组队打比赛不知道如何开始
或者是学习研究上有什么困难
都可以来公众号联系学姐
↓↓↓
点赞转发关注,给学姐点个赞叭!