[ML] 資料集分割－Lung-Yu,Tsai 的部落格

資料處理上時常會需要將資料切割,普遍的資料切割是分為訓練集和測試集

而當使用到部分方法(如 early stop)時,會牽涉到驗證集(valid set).

假設資料為一列(row)一筆資料的情況下,透過以下函數會將資料自動切成想要的比例,並儲存成對應的檔案.

import random

filename = "source_file_name"

def shift(filename,train_rate,test_rate,hasValid=False):
    source = open(filename + ".txt","r")

    f_train = open(filename + "_train.txt","w")
    f_test = open(filename + "_test.txt","w")

    if hasValid == True:
        f_valid = open(filename + "_valid.txt","w")

    src_ls = source.readlines()
    size = len(src_ls)
    size = len(src_ls)

    print ("train = ",int(train_rate * size) ,
     "valid = ",int((1.0-train_rate-test_rate))*size ,
     "test = ",int(test_rate* size)
    random.shuffle(src_ls)

    for i in range(size):
        data = src_ls[i]
        if i < size * train_rate :
            f_train.write(data)
            pass
        elif i < size * (test_rate + train_rate):
            f_test.write(data)
            pass
        elif hasValid == True:
            f_valid.write(data)
            pass
    

def main():

    shift(filename,train_rate=0.1 ,test_rate = 0.1,hasValid = False)
    pass

if __name__ == '__main__':
    main()

原理是先將資料讀取為list後,透過random.shuffle將資料打亂重新排列

資料完全打亂後,開始將資料依序儲存為訓練集、驗證集和測試集.

ML shift

Lung-Yu,Tsai

Lung-Yu,Tsai 的部落格

Lung-Yu,Tsai 發表在痞客邦留言(0) 人氣()

E-mail轉寄

Lung-Yu,Tsai 的部落格

Author Personal website

Tygr portfolio

Technology Record and Shared Space

[ML] 資料集分割

歷史上的今天

留言列表

文章搜尋

文章分類

Revit 二次開發 (2)

Security (2)

Program (14)

Machine Learing (16)

Infrastructure (5)

Software Engineering (2)

熱門文章

最新文章

文章精選

QR Code

最新留言

誰來我家

參觀人氣

RSS訂閱

Lung-Yu,Tsai 的部落格

Author Personal website Tygr portfolio Technology Record and Shared Space

[ML] 資料集分割

歷史上的今天

留言列表

文章搜尋

文章分類

Revit 二次開發 (2)

Security (2)

Program (14)

Machine Learing (16)

Infrastructure (5)

Software Engineering (2)

熱門文章

最新文章

文章精選

QR Code

最新留言

誰來我家

參觀人氣

RSS訂閱

Author Personal website

Tygr portfolio

Technology Record and Shared Space