資料處理上時常會需要將資料切割,普遍的資料切割是分為訓練集和測試集

而當使用到部分方法(如 early stop)時,會牽涉到驗證集(valid set).

 

假設資料為一列(row)一筆資料的情況下,透過以下函數會將資料自動切成想要的比例,並儲存成對應的檔案.

import random

filename = "source_file_name"

def shift(filename,train_rate,test_rate,hasValid=False):
    source = open(filename + ".txt","r")

    f_train = open(filename + "_train.txt","w")
    f_test = open(filename + "_test.txt","w")

    if hasValid == True:
        f_valid = open(filename + "_valid.txt","w")

    src_ls = source.readlines()
    size = len(src_ls)
    size = len(src_ls)

    print ("train = ",int(train_rate * size) ,
     "valid = ",int((1.0-train_rate-test_rate))*size ,
     "test = ",int(test_rate* size)
    random.shuffle(src_ls)

    for i in range(size):
        data = src_ls[i]
        if i < size * train_rate :
            f_train.write(data)
            pass
        elif i < size * (test_rate + train_rate):
            f_test.write(data)
            pass
        elif hasValid == True:
            f_valid.write(data)
            pass
    

def main():

    shift(filename,train_rate=0.1 ,test_rate = 0.1,hasValid = False)
    pass

if __name__ == '__main__':
    main()

 

原理是先將資料讀取為list後,透過random.shuffle將資料打亂重新排列

資料完全打亂後,開始將資料依序儲存為 訓練集、驗證集和測試集.

arrow
arrow
    文章標籤
    ML shift
    全站熱搜

    Lung-Yu,Tsai 發表在 痞客邦 留言(0) 人氣()