extract values of specific columns from dataframe to insert to a new dataframe row by row with panda-CodePudding

i have a dataframe, namely data, with a datetime index and the below columns :

                                id  activity     x          y           z
datetime                    
1970-01-01 00:42:00.219142823   1623    A   -0.152512   -8.585220   -1.219192
1970-01-01 00:42:00.269496827   1623    A   0.999466    -8.196548   -0.758926
1970-01-01 00:42:00.319850830   1623    A   0.450241    -8.701187   -1.290024
1970-01-01 00:42:00.370204834   1623    A   -0.042175   -9.739563   -1.787415
1970-01-01 00:42:00.420558838   1623    A   3.551483    -10.745132  -1.266403
... ... ... ... ... ...
1970-01-22 01:26:29.872699000   1644    A   2.239343    -8.408914   2.074087
1970-01-22 01:26:29.892898000   1644    A   2.548301    -8.157437   1.820215
1970-01-22 01:26:29.912994000   1644    A   2.636917    -7.786209   2.057322
1970-01-22 01:26:29.933195000   1644    A   2.545906    -7.743098   1.801055
1970-01-22 01:26:29.953291000   1644    A   2.373464    -8.071217   1.585503
279817 rows × 5 columns

every 119 rows i want to extract only the values of x, y, z columns, as well with the activity label, and put them in a new dataframe row by row. the values of each column followed up by the next column. like below :

values of column x|values of column y|values of column z|activity

next row after 119 rows of values of the dataframe data

values of column x|values of column y|values of column z|activity

etc

any ideas would be very helpful and much appreciated.

Thanks in advance for your time!

CodePudding user response：

If your data frame is called data, then you can use

data.iloc[::119, [data.columns.get_loc(col) for col in ['x', 'y', 'z', 'activity']]]

CodePudding user response：

Edit: after understanding that the question is really about getting data every 5 seconds, we can say so more directly:

wanted = ['x', 'y', 'z', 'activity']
newdf = df.resample('5s', origin='start').first()[wanted].copy()

Note: look at the various possibilities for the origin argument. Using 'start' means we start with the first index in df. But usually it is more natural to used the default ('start_day') which makes "round datetimes".

Original answer:

newdf = df.iloc[::119][['x', 'y', 'z', 'activity']].copy()

Example (reproducible setup):

import numpy as np

np.random.seed(0)
n = 279817
df = pd.DataFrame({
    'id': np.linspace(1623, 1644, n).round().astype(int),
    'activity': ['A'] * n,
}, index=pd.date_range(start='1970-01-01 00:42:00', end='1970-01-22 01:26:30', periods=n))
df[list('xyz')] = np.random.normal(size=(n, 3))

# code above, then

>>> newdf
                                      x         y         z activity
1970-01-01 00:42:00.000000000  1.764052  0.400157  0.978738        A
1970-01-01 00:54:52.762565400 -0.477974 -0.479656  0.620358        A
1970-01-01 01:07:45.525130800  1.327783 -0.101281 -0.803141        A
1970-01-01 01:20:38.287696200 -1.429991 -0.061638 -1.432735        A
1970-01-01 01:33:31.050261600 -1.109478 -0.547518  0.665967        A
...                                 ...       ...       ...      ...
1970-01-22 00:29:53.740994081 -0.267150  0.414945 -0.627917        A
1970-01-22 00:42:46.503559481 -0.001320  0.743854  0.137284        A
1970-01-22 00:55:39.266124882 -0.469361 -1.204360 -1.688222        A
1970-01-22 01:08:32.028690282  0.340768 -0.911023  0.069327        A
1970-01-22 01:21:24.791255682 -1.461045  1.938692 -0.188877        A

[2352 rows x 4 columns]