Outputting pandas series to txt file

17,152

Solution 1

I would suggest turning your pd.Series into a pd.DataFrame first.

df = pd.DataFrame.from_items(zip(series.index, series.str.split(' '))).T

So long as the Series has the same number of values (for every entry!), separated by a space, this will return a dataframe in this format

Out[49]: 
      0     1    2     3     4
0  3072   648  457  1035   260
1  1196   475  150  7153   671
2   838     1  300   953  1210
3  2278   151   21  4993  2628
4  1259  1035  339  2571  7153

Next I would name the columns appropriately

df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']

Finally, the dataframe is indexed by customer id (I am supposing this based upon your series index). We want to move that into the dataframe, and then reorganise the columns.

df['customer_id'] = df.index
df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]

This now leaves you with a dataframe like this

  customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
0            0      3072       648       457      1035       260
1            1      1196       475       150      7153       671
2            2       838         1       300       953      1210
3            3      2278       151        21      4993      2628
4            4      1259      1035       339      2571      7153

which I would recommend you write to disk as a csv using

df.to_csv('filepath.csv', index=False)

If however you want to write it as a text file, with only spaces separating, you can use the same function but pass the separator.

df.to_csv('filepath.txt', sep=' ', index=False)

I don't think that the Series object is the correct choice of data structure for the problem you want to solve. Treating numerical data as numerical data (and in a DataFrame) is far easier than maintaining 'space delimited string' conversions imo.

Solution 2

You can use the following approach, splitting the items of your Series object (that I called s) into lists and converting those a list of those lists into a DataFrame object (that I called df):

df = pd.DataFrame([[s.index[i]] + s.str.split(' ')[i] for i in range(0, len(s))])

The [s.index[i]] + s.str.split(' ')[i] part is responsible for concatenation of the index at the beginning of the movie ids lists, and this is done for all rows available in the series.

After that, you could just dump the DataFrame to a .txt file using a space as separator:

df.to_csv('output.txt', sep=' ', index=False)

You could also name your columns before dumping it, as suggested earlier.

Solution 3

It's also worth avoiding that csv-writing hackery, kind of required when the series is text to avoid escaping/quoting hell. A la:

with open(filename, 'w') as f:
    for entry in df['target_column']:
        f.write(entry)

Of course you can add the series index yourself in the loop, if desired.

Solution 4

I suggest modifying the code as shown below

import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np


def predict(l):
    # finds the userIds corresponding to the top 5 similarities
    # calculate the prediction according to the formula
    return (df[l.index] * l).sum(axis=1) / l.sum()


# use userID as columns for convinience when interpretering the forumla
df = pd.read_csv('ratings.csv').pivot(columns='userId',
                                                index='movieId',
                                                values='rating')
df = df - df.mean()
similarity = pd.DataFrame(cosine_similarity(
    df.T.fillna(0)), index=df.columns, columns=df.columns)

res = df.apply(lambda col: (0 * col).fillna(
    predict(similarity[col.name].nlargest(6).iloc[1:])
).nlargest(5).index.tolist()
).apply(pd.Series).rename(
    columns=lambda col_name: 'movie-id{}'.format(col_name + 1)).reset_index(
).rename(columns={'userId': 'customer_id'})
# convert to csv
res.to_csv('filepath.txt', sep = ' ',index = False)

res.head()

In [2]: res.head()
Out[2]: 
   customer_id  movie-id1  movie-id2  movie-id3  movie-id4  movie-id5
0            1       3072       1196        838       2278       1259
1            2        648        475          1        151       1035
2            3        457        150        300         21        339
3            4       1035       7153        953       4993       2571
4            5        260        671       1210       2628       7153

show the file

   In [3]: ! head -5 filepath.txt
customer_id movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
1 3072 1196 838 2278 1259
2 648 475 1 151 1035
3 457 150 300 21 339
4 1035 7153 953 4993 2571
Share:
17,152
mrsquid
Author by

mrsquid

Updated on June 04, 2022

Comments

  • mrsquid
    mrsquid almost 2 years

    I have a pandas series object

    <class 'pandas.core.series.Series'>
    

    that look like this:

    userId
    1          3072 1196 838 2278 1259
    2               648 475 1 151 1035
    3               457 150 300 21 339
    4          1035 7153 953 4993 2571
    5           260 671 1210 2628 7153
    6          4993 1210 2291 589 1196
    7               150 457 111 246 25
    8       1221 8132 30749 44191 1721
    9           296 377 2858 3578 3256
    10          2762 377 2858 1617 858
    11           527 593 2396 318 1258
    12        3578 2683 2762 2571 2580
    13        7153 150 5952 35836 2028
    14        1197 2580 2712 2762 1968
    15        1245 1090 1080 2529 1261
    16         296 2324 4993 7153 1203
    17       1208 1234 6796 55820 1060
    18            1377 1 1073 1356 592
    19           778 1173 272 3022 909
    20              329 534 377 73 272
    21            608 904 903 1204 111
    22       1221 1136 1258 4973 48516
    23        1214 1200 1148 2761 2791
    24             593 318 162 480 733
    25               314 969 25 85 766
    26        293 253 4878 46578 64614
    27          1193 2716 24 2959 2841
    28         318 260 58559 8961 4226
    29            318 260 1196 2959 50
    30        1077 1136 1230 1203 3481
    
    642            123 593 750 1212 50
    643         750 671 1663 2427 5618
    644            780 3114 1584 11 62
    645         912 2858 1617 1035 903
    646           608 527 21 2710 1704
    647         1196 720 5060 2599 594
    648         46578 50 745 1223 5995
    649            318 300 110 529 246
    650            733 110 151 318 364
    651         1240 1210 541 589 1247
    652      4993 296 95510 122900 736
    653            858 1225 1961 25 36
    654        333 1221 3039 1610 4011
    655           318 47 6377 527 2028
    656          527 1193 1073 1265 73
    657             527 349 454 357 97
    658            457 590 480 589 329
    659              474 508 1 288 477
    660         904 1197 1247 858 1221
    661           780 1527 3 1376 5481
    662             110 590 50 593 733
    663          2028 919 527 2791 110
    664    1201 64839 1228 122886 1203
    665        1197 858 7153 1221 6539
    666            318 300 161 500 337
    667            527 260 318 593 223
    668            161 527 151 110 300
    669          50 2858 4993 318 2628
    670          296 5952 508 272 1196
    671         1210 1200 7153 593 110
    

    What is the best way to go about outputting this to a txt file (e.g. output.txt) such that the format look like this?

    User-id1 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
    User-id2 movie-id1 movie-id2 movie-id3 movie-id4 movie-id5
    

    The values on the far left are the userId's and the other values are the movieId's.

    Here is the code that generated the above:

    import pandas as pd
    from sklearn.metrics.pairwise import cosine_similarity
    import numpy as np
    
    def predict(l):
        # finds the userIds corresponding to the top 5 similarities
        # calculate the prediction according to the formula
        return (df[l.index] * l).sum(axis=1) / l.sum()
    
    
    # use userID as columns for convinience when interpretering the forumla
    df = pd.read_csv('ratings.csv').pivot(columns='userId',
                                                    index='movieId',
                                                    values='rating')
    df = df - df.mean()
    similarity = pd.DataFrame(cosine_similarity(
        df.T.fillna(0)), index=df.columns, columns=df.columns)
    
    res = df.apply(lambda col: ' '.join('{}'.format(mid) for mid in (0 * col).fillna(
        predict(similarity[col.name].nlargest(6).iloc[1:])).nlargest(5).index))
    
    
    
    #Do not understand why this does not work for me but works below
    df = pd.DataFrame.from_items(zip(res.index, res.str.split(' ')))
    #print(df)
    df.columns = ['movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']
    df['customer_id'] = df.index
    df = df[['customer_id', 'movie-id1', 'movie-id2', 'movie-id3', 'movie-id4', 'movie-id5']]
    df.to_csv('filepath.txt', sep=' ', index=False)
    

    I tried implementing @emmet02 solution but got this error, I do not understand why I got it though:

    ValueError: Length mismatch: Expected axis has 671 elements, new values have 5 elements
    

    Any advice is appreciated, please let me know if you need any more information or clarification.

  • mrsquid
    mrsquid over 6 years
    I was just attempting something like this! Wasn't able to get it too work though. Thank you! I see my mistake now! Also do you have any above changes that would keep it as integers? I was only able to sort using strings, though I will probably make another question.
  • emmet02
    emmet02 over 6 years
    If the entire dataframe is integers use df = df.astype(int) to cast all values to ints.
  • piRSquared
    piRSquared over 6 years
    @mrsquid you now have enough rep to upvote. Consider showing additional appreciation by upvoting any answers you found helpful, including this one.
  • mrsquid
    mrsquid over 6 years
    Yes I will be sure too! Was waiting to give upvotes since these guys definitely deserve them!
  • mrsquid
    mrsquid over 6 years
    One quick question when I implement your solution I get an error? I have updated in the solution above, do you know what I get an error while yours does not?
  • mrsquid
    mrsquid over 6 years
    Like I am unable to get the same output[49] you get.
  • mrsquid
    mrsquid over 6 years
    For some reason my df looks different than yours when I add the code, it causes my df to have 671 columns and 5 rows rather than the opposite
  • mrsquid
    mrsquid over 6 years
    Actually I was able to get it! just needed a simple transpose, still not sure why your output originally looks different. Thank you for your help!
  • mrsquid
    mrsquid over 6 years
    Brilliant! Thank you!
  • holypriest
    holypriest over 6 years
    @mrsquid Nice to help. I think zip will not work as intended here, giving you a transposed version of what you want.
  • mrsquid
    mrsquid over 6 years
    Yeah you're right it gave a transposed version. I was able to transpose it to get the desired output. Didn't know that zip did that in this scenario, still not sure the why though.
  • mrsquid
    mrsquid about 6 years
    Apologies for the delay in saying this, but thank you for your help again!
  • Minions
    Minions almost 5 years
    thanx it words .. I have a question: How pandas handle BLANK space as separator while there are many others in the tweets?
  • Casivio
    Casivio almost 4 years
    How do you add the index? I am assuming that the index is unique names, not a simple iterator.