資料處理上時常會需要將資料切割,普遍的資料切割是分為訓練集和測試集
而當使用到部分方法(如 early stop)時,會牽涉到驗證集(valid set).
假設資料為一列(row)一筆資料的情況下,透過以下函數會將資料自動切成想要的比例,並儲存成對應的檔案.
import random filename = "source_file_name" def shift(filename,train_rate,test_rate,hasValid=False): source = open(filename + ".txt","r") f_train = open(filename + "_train.txt","w") f_test = open(filename + "_test.txt","w") if hasValid == True: f_valid = open(filename + "_valid.txt","w") src_ls = source.readlines() size = len(src_ls) size = len(src_ls) print ("train = ",int(train_rate * size) , "valid = ",int((1.0-train_rate-test_rate))*size , "test = ",int(test_rate* size) random.shuffle(src_ls) for i in range(size): data = src_ls[i] if i < size * train_rate : f_train.write(data) pass elif i < size * (test_rate + train_rate): f_test.write(data) pass elif hasValid == True: f_valid.write(data) pass def main(): shift(filename,train_rate=0.1 ,test_rate = 0.1,hasValid = False) pass if __name__ == '__main__': main()
原理是先將資料讀取為list後,透過random.shuffle將資料打亂重新排列
資料完全打亂後,開始將資料依序儲存為 訓練集、驗證集和測試集.
文章標籤
全站熱搜