attributeerror: 'dataframe' object has no attribute 'data

I am getting the following error : attributeerror: 'dataframe' object has no attribute 'data_type'" . I am trying to recreate the code from this link which is based on this article with my own dataset which is similar to the article

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

df.groupby(['Conference', 'label', 'data_type']).count()

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased',
                                          do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].example.values,
    add_special_tokens=True,
    return_attention_mask=True,
    pad_to_max_length=True,
    max_length=256,
    return_tensors='pt'
)

and this is the error I get:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_24180/2662883887.py in <module>
      3 
      4 encoded_data_train = tokenizer.batch_encode_plus(
----> 5     df[df.data_type=='train'].example.values,
      6     add_special_tokens=True,
      7     return_attention_mask=True,

C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5485         ):
   5486             return self[name]
-> 5487         return object.__getattribute__(self, name)
   5488 
   5489     def __setattr__(self, name: str, value) -> None:

AttributeError: 'DataFrame' object has no attribute 'data_type'

I am using python: 3.9; pytorch :1.10.1; pandas: 1.3.5; transformers: 4.15.0

CodePudding user response：

The error means you have no data_type column in your dataframe because you missed this step

from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]  # <- HERE

df.loc[X_train, 'data_type'] = 'train'  # <- HERE
df.loc[X_val, 'data_type'] = 'val'  # <- HERE

df.groupby(['Conference', 'label', 'data_type']).count()

Demo

Setup

import pandas as pd
from sklearn.model_selection import train_test_split

# The Data
df = pd.read_csv('data/title_conference.csv')
df['label'] = pd.factorize(df['Conference'])[0]

# Train and Validation Split
X_train, X_val, y_train, y_val = train_test_split(df.index.values, 
                                                  df.label.values, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  stratify=df.label.values)

df['data_type'] = ['not_set']*df.shape[0]

df.loc[X_train, 'data_type'] = 'train'
df.loc[X_val, 'data_type'] = 'val'

Code

from transformers import BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', 
                                          do_lower_case=True)

encoded_data_train = tokenizer.batch_encode_plus(
    df[df.data_type=='train'].Title.values, 
    add_special_tokens=True, 
    return_attention_mask=True, 
    pad_to_max_length=True, 
    max_length=256, 
    return_tensors='pt'
)

Output:

>>> encoded_data_train
{'input_ids': tensor([[  101,  8144,  1999,  ...,     0,     0,     0],
        [  101,  2152,  2836,  ...,     0,     0,     0],
        [  101, 22454, 25806,  ...,     0,     0,     0],
        ...,
        [  101,  1037,  2047,  ...,     0,     0,     0],
        [  101, 13229,  7375,  ...,     0,     0,     0],
        [  101,  2006,  1996,  ...,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}