2024-04-23

AtCoder ARC176 (2024-04-21)

ARC176 に参戦．前回 ARC175 に続いて NoSub……

問題Ａはコンテスト終了間際に，列ごとの合計値を持たせておいて上から貪欲で決めていく実装を思いついたが，間に合わず（のちに嘘解法と判明）．問題Ｂは式変形がポイントだとすぐにわかったけれど，思いつかず……

パフォーマンス：347
レーティング：156 (+27)

問題Ｂ - Simple Math 4 (400 点)

【問題ページ】 Difficulty: 1298

問題の概要

$2^{N}$ を $2^{M} - 2^{K}$ で割ったあまりの 1 の位を求める．

制約条件

$1 \leq N \leq 10^{18}$
$1 \leq K \lt M \leq 10^{18}$

解き方

$N \geq K$ のとき， $2^{N} \equiv 2^{N} - 2^{N - M} (2^{M} - 2^{K}) \equiv 2^{N - (M - K)}(\mathrm{mod}\ 2^{M} - 2^{K})$ と式変形させる．あとは場合分け．

以下，コンテスト終了後に解説を見ながら組んだ AC コード (Python)．CPython 3.11.4 だと 1 ケースで TLE，PyPy3.10-v7.3.12 で AC.

t = int(input())
for _ in range(t):
  n, m, k = map(int, input().split())
  if m - k == 1:
    if n >= m - 1:
      print(0)
    else:
      print(pow(2, n, 10))
  else:
    while n >= m:
      n = k * (n // m) + n % m
    
    if n == k and k == m - 1:
      print(0)
    else:
      print(pow(2, n, 10))

ARC，さすがに一筋縄ではいかない．がんばる．

Best regards, e271828.

2024-03-25

AtCoder ARC175 (2024-03-24 Sun)

競プロ AtCoder ARC Python

ARC175 に参戦．問題 B にチャレンジするも NoSub……

パフォーマンス：334
レーティング：94 (+42)

問題Ｂ - Parenthesis Arrangement (400 点)

【問題ページ】 Difficulty: 924

問題の概要

長さ $2N$ の"(", ")" からなる文字列が与えられ，コスト $A$ で隣接する文字を入れ替えるか，コスト $B$ で文字を反転させて正しい括弧列にするとき，必要なコストの総和の最小値を求める．

正しい括弧列とは，以下のいずれかの条件を満たす文字列．
- 空文字列
- ある正しい括弧列 $A$ が存在して，"(", $A$ , ")" をこの順に結合した文字列
- ある空でない正しい括弧列 $S, T$ が存在して， $S, T$ をこの順に結合した文字列

制約条件

$1 \leq N \leq 5 \times 10^{5}$
$1 \leq A, B \leq 10^{9}$

解き方

「正しい括弧列」という頻出？テーマ． "(" と ")" の個数が等しく，"(" を $+1$ ，")" を $-1$ として得られる数列の累積和の最小値が $0$ であるとき「正しい括弧列」となる．

コンテスト中にこれらの条件には気づいていたが，各操作でどれを入れ替える or 反転させるべきかがわからず，結局 AC は取れず．もう少しだったのにもったいない．

以下，コンテスト終了後に解説を見ながら組んだ AC コード (Python)．

n, a, b = map(int, input().split())
a = min(a, 2 * b)
s = list(input())
l = s.count("(")
r = 2 * n - l
d = l - r
rslt = 0
if l > r:
  for i in range(2*n-1, -1, -1):
    if s[i] == "(" and d:
      s[i] = ")"
      rslt += b
      d -= 2
else:
  for i in range(2*n):
    if s[i] == ")" and d:
      s[i] = "("
      rslt += b
      d += 2

cum, mn = 0, 0
for i in range(2*n):
  if s[i] == "(":
    cum += 1
  else:
    cum -= 1
  mn = min(mn, cum)
rslt += ((1 - mn) // 2) * a
print(rslt)

次回の ARC では少なくとも１問は AC を取る．

Best regards, e271828.

2024-03-18

AtCoder ARC174 (2024-03-17 Sun)

競プロ AtCoder ARC Python

ARC174 に参戦．問題Ａのみ AC．

パフォーマンス：785
レーティング：52 (初)

問題Ａ - A Multiply (300 点)

【問題ページ】 Difficulty: 467

問題の概要

長さ $N$ の整数列 $A$ と整数 $C$ が与えられ，以下の操作を高々１回行うときの $A$ の総和の最大値を求める．

$1 \leq l \leq r \leq N$ を満たす $l, r$ を指定し， $A_l, A_{l+1},\dots, A_r$ の全ての要素を $C$ 倍する．

制約条件

$1 \leq N \leq 3 \times 10^{5}$
$-10^{6} \leq C \leq 10^{6}$
$-10^{6} \leq A_i \leq 10^{6}$

解き方

区間和最大・最小，という頻出？テーマ．

$C$ についての if 文の条件式で $C = 0$ の場合が漏れていて 1WA 出してしまった．イージーミス，もったいない．

以下，AC コード (Python)．

n, c = map(int, input().split())
a = list(map(int, input().split()))
s = sum(a)
mss, x = 0, 0
if c <= 0:
  a = [-a[i] for i in range(n)]

for i in range(n):
  x += a[i]
  if x < 0:
    x = 0
  if x > mss:
    mss = x

if c <= 0:
  print(s + mss - mss * c)
else:
  print(s - mss + mss * c)

次回はもっと注意してコード提出しよう．

Best regards, e271828.

2024-02-29

競プロでよく使うテク with Python

競プロ Python

Introduction

ここでは，競技プログラミング（競プロ）で Python を使う際によく利用するテクニックを備忘のためメモに残しておく．バージョンは CPython 3.11.4 を想定している（2024 年 3 月時点）．以下，サンプルコードは部分的かつ厳密ではなく，自分で形式がわかれば OK としている（自分の備忘のためなので）．

Introduction
list
dictionary (dict)
set
deque
Counter
- Counter の生成
その他イテレート関連
- enumerate
- zip
- range
- map
数学関連
- abs
- sqrt
- gcd / lcm
- exp
- log / log2 / log10
その他

list

生成（内包表記）

a = [k ** 2 for k in range(1, 101) if k % 2 == 0]

イテレート

s = 0
for x in a:
  s += x

指定要素の個数

num = a.count("ABC")

list 末尾に要素追加

a.append(x)

list 末尾の要素削除

t = a.pop()

list の指定位置に要素挿入

a.insert(0, x)

指定要素の list 内での存在判定

bin = "ABC" in a

整数 list 内要素の総和

a = [1, 2, 3, 4, 5]
s = sum(a)  # 15

整数 list 内要素の最大値・最小値

a = [1, 2, 3, 4, 5]
max_a = max(a)  # 5
min_a = min(a)  # 1

list 要素のソート（破壊的・非破壊的）

a.sort()
b = sorted(a)

整数 list の要素順序の反転

a.reverse()
a.sort(reverse=True)

dictionary (dict)

イテレート

d = {"A": 1, "B": 2, "C": 3}
for k, v in d.items():
  print(k, v)

指定 key と value の削除

d.pop("A")

指定要素の dict 内 key での存在判定

bin = "A" in d

set

要素追加

s.add(998244353)

要素削除

s.remove(998244353)

指定要素の set 内での存在判定

bin = 998244353 in s

deque

空の deque の生成

from collections import deque
d = deque()

deque 先頭に要素追加

d.appendleft(998244353)

deque 末尾に要素追加

d.append(998244353)

deque 先頭の要素削除

t = d.popleft()

deque 末尾の要素削除

t = d.pop()

イテレート

d = deque([1, 2, 3, 4, 5])
for v in d:
  print(v)

Counter

Counter の生成

from collections import Counter
a = [1, 1, 2, 3, 4, 4, 4, 5]
c = Counter(a)

その他イテレート関連

enumerate

a = [1, 2, 3, 4, 5]
idx = [(i, x) for i, x in enumerate(a)]

zip

x = [1, 2, 3, 4, 5]
y = [6, 7, 8, 9, 10]
z = list(zip(x, y))

range

a = list(range(1, 10))

map

a = [-2, -1, 0, 1, 2]
b = list(map(abs, a))
c = list(map(lambda k: k ** 2, a))

数学関連

abs

a = abs(x)

sqrt

import math
s = sqrt(x)

gcd / lcm

import math
a, b, c = 6, 12, 18
print(math.gcd(a, b, c))  # 6
print(math.lcm(a, b, c))  # 36

exp

import math
a = math.exp(2)
b = math.e ** 2

print(a == b)
# False
# math.exp(2) のほうがより正確な値を取る

log / log2 / log10

import math
a = math.log(10, 2)
b = math.log2(10)
c = math.log10(10)

print(a == b)
# False
# math.log2(10) のほうがより正確な値を取る（log10(x), log(x, 10) も同様）

その他

進数変換 (2 / 8 / 16)

x = 255
print(bin(x))
print(oct(x))
print(hex(x))

# 0b11111111
# 0o377
# 0xff

print(int("10"))
print(int("10", 2))
print(int("10", 8))
print(int("10", 16))

# 10
# 2
# 8
# 16

Unicode / 文字変換

print(ord("A"))
print(ord("a"))

# 65
# 97

print(chr(65))
print(chr(97))

# A
# a

bisect / bisect_left / bisect_right

import bisect
a = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19]

print(bisect.bisect(a, 13))
print(bisect.bisect_left(a, 13))
print(bisect.bisect_right(a, 13))

# 7
# 6
# 7

accumulate

from itertools import accumulate
a = [2, 3, 5, 7, 9, 11, 13, 15, 17, 19]
acc = list(accumulate(a))
print(*acc)

# 2 5 10 17 26 37 50 65 82 101

随時更新．

Best regards, e271828.

2024-01-15

Certified Data Management Professional (CDMP)

CDMP データマネジメント資格試験

CDMP is なに

データマネジメントに関する国際資格，米国の非営利団体 DAMA International が運営
世界 98 か国で 7,372 名の資格保有者（2023 年 8 月時点）
CDMP は 4 レベル (Associate / Practitioner / Master / Fellow)
- Associate: Fundamentals Exam 60% 以上
- Practitioner: Fundamentals Exam 70% 以上かつ Specialist Exams 2 つ以上合格
- Master: Fundamentals Exam 80% 以上かつ Specialist Exams 2 つ以上合格
- Fellow: Master かつ定性評価＋推薦
Specialist Exams: Category
- Data Modeling and Design
- Metadata
- Data Quality
- Data Governance
- Data Warehousing & Business Intelligence
- Reference And Master Data Management
- Data Integration And Interoperability
各試験 USD $311（再試は USD $211）
取得後の最初の有効期間は 3 年，以降は 1 年更新 (USD $100)
試験形式
- オンライン受験（自宅可）
- 100 問 90 分（+20 分: English as a Second Language）
- 4 or 5 択の選択式問題
- 試験中オンライン監視あり（要 Web カメラ），外付けモニター使用不可
- 1 つだけ資料持込み可

出題範囲・出題割合

Data Modelling and Design（データモデリングとデザイン）11%
Metadata Management（メタデータ管理）11%
Data Quality（データ品質）11%
Data Governance（データガバナンス）11%
Data Warehousing and Business Intelligence（データウェアハウスとビジネスインテリジェンス）10%
Master and Reference Data Management（マスタ／参照データ管理）10%
Data Integration and Interoperability（データ統合と相互運用性）6%
Data Security（データセキュリティ）6%
Data Architecture（データアーキテクチャ）6%
Data Storage and Operations（データストレージとオペレーション）6%
Document and Content Management（ドキュメント／コンテンツ管理）6%
Data Management Process（データ管理プロセス）2%
Big Data（ビッグデータ）2%
Data Ethics（データ倫理）2%

DMBOK

Best regards, e271828

2024-01-02

<kaggle> Summary List: Tabular Playground Series Competitions 2024

Kaggle Competition Playground Tabular

Playground series 2024

list up past Tabular Playground Series competitions and these overview.

update the list from time to time.

Competitions List

List by tasks

[E.xx] link to that competition page

Regression	Binary Classification	Multi-Label Classification
E.04	E.01	E.02*
-	-	E.03

*Time Series Analysis

[Season 4, Episode 4]

Regression with an Abalone Dataset

Task: Regression (to predict the the Target "Rings")

Evaluation: RMSLE

[Season 4, Episode 3]

Steel Plate Defect Prediction

Task: Multi-Label Classification (to predict the the probability for each of 7 defect categories)

Evaluation: AUC

[Season 4, Episode 2]

Multi-Class Prediction of Obesity Risk

Task: Multi-Label Classification (to predict the class value of the target, NObeyesdad)

Evaluation: Accuracy score

[Season 4, Episode 1]

Binary Classification with a Bank Churn Dataset

Task: Binary Classification (to predict the probability of Exited)

Evaluation: AUC

Best regards, e271828

2023-10-26

<kaggle> Code MEMO - ICR Competition

Kaggle Competition Tabular BinaryClassification Health Python

Overview

Competition: ICR - Identifying Age-Related Conditions (Featured Code Competition)

Tags: Tabular / Binary Classification / Health

Timeline: 2023/05/12 - 2023/08/11 (JST)

Evaluation: balaced Log Loss

Result: 206th / 6,430 (Solo Silver medal), Private Score: 0.40019, Submission Entries: 7

Code

Notebook Option: GPU is note used / Internet off

Preparation

import libraries (to use data/model handling)

# import libraries
import sys
sys.path.append("/kaggle/input/iterativestratification")

import gc
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from catboost import CatBoostClassifier, Pool
from sklearn.preprocessing import LabelEncoder
from iterstrat.ml_stratifiers import MultilabelStratifiedKFold

import warnings
warnings.simplefilter("ignore")

print("imported.")

read train/test/option data as pandas dataframe

# read train-data
df_train = pd.read_csv("/kaggle/input/icr-identify-age-related-conditions/train.csv")
df_train = df_train.rename(columns={"BD ": "BD", "CD ": "CD", "CW ": "CW", "FD ": "FD"})
df_train

# read test-data
df_test = pd.read_csv("/kaggle/input/icr-identify-age-related-conditions/test.csv")
df_test = df_test.rename(columns={"BD ": "BD", "CD ": "CD", "CW ": "CW", "FD ": "FD"})
df_test

# read greeks-data
df_greeks = pd.read_csv("/kaggle/input/icr-identify-age-related-conditions/greeks.csv")
df_greeks = df_greeks.astype("category")
df_greeks

Data preprocessing

data handling with greeks(optional data)

# split Epsilon (Month/Day/Year)
epsilon = list(df_greeks["Epsilon"])
eps = []
for i in range(len(epsilon)):
    if epsilon[i] != "Unknown":
        eps.append(list(map(int, epsilon[i].split("/"))))
    else:
        eps.append([0, 0, 0])

df_epsilon = pd.DataFrame(eps, columns=["Month", "Day", "Year"])
df_greeks = pd.concat([df_greeks, df_epsilon], axis=1)
df_greeks

encoding categorical features

# encode categorical features
le = LabelEncoder()
df_train["EJ"] = le.fit_transform(df_train["EJ"])
df_test["EJ"] = le.transform(df_test["EJ"])

print("[train]")
print(df_train["EJ"].value_counts())
print("[test]")
print(df_test["EJ"].value_counts())

# encode categorical features(Greeks)
df_greeks["Alpha"] = le.fit_transform(df_greeks["Alpha"])
df_greeks["Beta"] = le.fit_transform(df_greeks["Beta"])
df_greeks["Gamma"] = le.fit_transform(df_greeks["Gamma"])
df_greeks["Delta"] = le.fit_transform(df_greeks["Delta"])
df_greeks["Epsilon"] = le.fit_transform(df_greeks["Epsilon"])

print(df_greeks["Alpha"].value_counts())
print(df_greeks["Beta"].value_counts())
print(df_greeks["Gamma"].value_counts())
print(df_greeks["Delta"].value_counts())
print(df_greeks["Epsilon"].value_counts())

shaping datasets for model-fitting

# set dataset for train
train_data = df_train.drop(columns=["Id", "BZ", "DV"])
cols = train_data.columns[:-1]
train_data

# set dataset for test
x_test = df_test.drop(columns=["Id", "BZ", "DV"])
id_test = df_test[["Id"]]
x_test

definition of Evaluation index(balanced Log Loss)

# calculate Balanced LogLoss
def balancedLogLoss(y_true, y_pred):
    y_pred = np.clip(y_pred, 1e-15, 1 - 1e-15)
    N = np.bincount(y_true)
    w0, w1 = 1 / (N[0] / y_true.shape[0]), 1 / (N[1] / y_true.shape[0])
    
    sum_zero = np.sum(np.where(y_true == 0, 1, 0) * np.log(y_pred[:, 0]))
    sum_one = np.sum(np.where(y_true != 0, 1, 0) * np.log(y_pred[:, 1]))
    BalancedLogLoss = -((w0 / N[0]) * sum_zero + (w1 / N[1]) * sum_one) / (w0 + w1)
    
    return BalancedLogLoss

print("defined.")

Fitting and Prediction

setting parameters

# random seeds
seeds = range(124)
#seeds = [3, 22, 45, 123]
#seeds += random.sample(range(4, 22), 10) + random.sample(range(23, 45), 11) + random.sample(range(46, 123), 20)
#seeds.sort()
#print("selected seeds:", seeds)
#print("")

# parameter
n_splits = 5
best_BLL = 100000
best_y_test_preds = []
best_oof = np.zeros((len(train_data), 2))
best_imp = pd.DataFrame()

fitting and prediction using CatBoostClassifier with MultilabelStratifiedKFold and seed-averaging

# fitting by CatBoost with Multi-Class Stratified K-Fold cross-validation
for seed in seeds:
    y_test_preds = []
    oof = np.zeros((len(train_data), 2))
    imp = pd.DataFrame()
    cv = list(MultilabelStratifiedKFold(n_splits=n_splits, shuffle=True, random_state=seed).split(train_data, df_greeks.iloc[:, 1:]))
    print("-"*20, "seed:", seed, "-"*20)
    
    params = {
        "loss_function": "MultiClass",
        "eval_metric": "MultiClass:use_weights=False",
        "n_estimators": 10000,
        "learning_rate": 0.005,
        "random_state": seed,
        "l2_leaf_reg": 1,
        "auto_class_weights": "Balanced",
        "use_best_model": True,
        "max_ctr_complexity": 15,
        "depth": 10,
        "grow_policy": "Lossguide",
        "max_leaves": 64,
        "min_data_in_leaf": 40,
    }
    
    for nfold in np.arange(n_splits):
        idx_tr, idx_va = cv[nfold][0], cv[nfold][1]
        x_tr, y_tr = train_data.loc[idx_tr, cols], train_data.loc[idx_tr, "Class"]
        x_va, y_va = train_data.loc[idx_va, cols], train_data.loc[idx_va, "Class"]
        train_pool = Pool(x_tr, y_tr)
        valid_pool = Pool(x_va, y_va)
        
        # fitting
        model = CatBoostClassifier(**params)
        model.fit(train_pool, eval_set=valid_pool,
                 verbose=False,
                 early_stopping_rounds=1000,
                 use_best_model=True
                 )
        
        # prediction
        y_tr_pred = model.predict_proba(x_tr)
        y_va_pred = model.predict_proba(x_va)
        oof[idx_va, :] = y_va_pred
        y_test_preds.append(model.predict_proba(x_test))
        print("Balanced LogLoss", nfold, ":", "{:.5f}".format(balancedLogLoss(y_va, y_va_pred)))
        
        # importance of features
        _imp = pd.DataFrame({"features": cols, "importance": model.feature_importances_, "nfold": nfold})
        imp = pd.concat([imp, _imp], axis=0, ignore_index=True)
        
        del idx_tr, idx_va, x_tr, x_va, y_tr, y_va, train_pool, valid_pool, model, y_tr_pred, y_va_pred
        gc.collect()
    
    # Balanced LogLoss
    BLL = balancedLogLoss(train_data["Class"], oof)
    if BLL < best_BLL:
        best_BLL = BLL
        best_y_test_preds = y_test_preds
        best_oof = oof
        best_imp = imp
    
    print("Best Balanced LogLoss(Temporary):", "{:.5f}".format(best_BLL))
    print("")
    del BLL, y_test_preds, oof, imp
    gc.collect()

print("-"*20, "result", "-"*20)
print("Best Balanced LogLoss:", "{:.5f}".format(balancedLogLoss(train_data["Class"], best_oof)))
print("")

print("-"*14, "feature importance", "-"*14)
print("")
best_imp = best_imp.groupby("features")["importance"].agg(["mean", "std"])
best_imp.columns = ["importance", "importance_std"]
best_imp["importance_cov"] = best_imp["importance_std"] / best_imp["importance"]
best_imp = best_imp.reset_index(drop=False)
display(best_imp.sort_values("importance", ascending=False, ignore_index=True))

Submission

Taking the average of the prediction results and setting them as pandas dataframe

# set dataset for submission
sample_sub = pd.read_csv("/kaggle/input/icr-identify-age-related-conditions/sample_submission.csv")
df_submit = pd.DataFrame(columns=sample_sub.columns)
df_submit["Id"] = id_test["Id"]
df_submit[["class_0", "class_1"]] = np.mean(best_y_test_preds, axis=0)
df_submit

# submission
df_submit.to_csv("submission.csv", index=None)
print("completed.")

Best regards, e271828