The goal of this guide is to build a system capable of chatting like you, using your own WhatsApp and Telegram chats as an ML dataset. What we will do could be summarized in the following steps:。 So let’s start our adventure!。 It’s always better not to run random scripts on personal information (like personal chat messages).。 I guarantee there’s no catch, but you can always check the native code that’s being used: messaging-chat-parser and pistoBot, all the sources are open-source on github.。 First of all, we need to gather the data from our chat applications. We will now learn how to export data from two of the most commonly used instant messaging apps: WhatsApp and Telegram.。 We have to export one .txt file for each chat we want to include in the final dataset. So, as described on the official WhatsApp website:。 Note that only 1 to 1 chats are allowed (namely individual), we suggest to export chats with the highest number of messages, in order to achieve a bigger dataset and get better final results.。 Now you should have more files, each with a structure that looks like the snippet below:。 Take note of the text you find under <YourName> placeholder in your exported chats. This parameter is your name for the WhatsApp app and we will use this value later.。 The process here will be faster than WhatsApp because Telegram will export everything in a single .json file, without having the limit of exporting one chat at a time.。 So, as described on the official Telegram website:。 Now you should have one file named telegram_dump.json with this structure:。 To train a GPT-2 neural network, first of all we need to pre-process the data, in order to obtain a single .txt with a machine-learning compatible structure.。 For the sake of simplicity and since the ML model we will use requires a GPU to work, we are going to use Google Colab for the next step.。 If you don’t know what Google Colab is, check this other article.。 Open this Colab notebook and follow these steps:。 To work with the data, we need to upload them on Colab, into the right folders.。 Select all your .txt files and upload everything into the following notebook folder:。 ./messaging-chat-parser/data/chat_raw/whatsapp/。 Get the file telegram_dump.json and upload it into the following notebook folder:。 ./messaging-chat-parser/data/chat_raw/telegram/。 Now, run all the cells up until the block “2️⃣ Parse the data”.。 Here we need to replace the variable “whatsapp_user_name” with your WhatsApp name, called <YourName> on the 1.1 chapter.。 You can also change the date format parsing system if some of the exported data show a different format due to local time formatting.。 So, for example, if my name is “Bob” and I’m from America, the code I should use is the following:。 Now execute the cell under the “3️⃣ Train a GTP2 model” notebook chapter, it will run a new training using your provided data.。 A progress bar will be shown, and the training could take up to 10 hours, it depends mostly on which GPU type Colab is running and how many much messages were provided.。 Wait until the process ends.。 After the training is completed, run all the remaining notebook cells: the last one will show a text block with a ✍ symbol on the left.。 You could use this text box to insert the messages you want to “send” to the ML model. So write your message and then press enter.。 After the first message is sent, the system will prompt some information about the conversation.。 You will now see the most interesting results as a list of messages:。 After the replied message is generated, you could continue to chat for a total of 5 messages. After this, you could re-run the cell to start a new conversation with the model.。 So in this guide we have seen how simple it is to train your GPT-2 model from scratch, the task is simple (but not trivial!) only thanks to to aitextgen package that runs under the pistoBot hood.。 Note that if your chat messages are in English you could easily obtain better results than the ones we got with this standard approach, since you could use the transfer learning from a GPT-2 pretrained model.。 The pistoBot repository allows you to train (or fine-tune) different models, including the chance to start from a GPT-2 pretrained model: check the repository folder for more information.。 We have chosen the standard, un-trained GPT-2 model so that even the non-english users could use this AI.。 This article was written by Simone Guardati and originally published on Towards Data Science. You can read it here. 。A quick warning before we get started。
Get the data。
WhatsApp export。
1.2 Telegram。
2. Parse the data。
2.1 Google Colab。
2.2 Start the notebook。
2.3 Load the data。
WhatsApp chats。
Telegram JSON。
2.4 Parse the data。
3. Train a GPT-2 model。
4. Chat with the model。
4.1 How read the results。
The ones you have sent to the model。
The ones generated by the model.。5. Conclusion。
How to create an AI that chats like you on WhatsApp
••
(1272)
Stop calling it bias. AI is racist
Previous2024-05-19 06:41
MIT’s coronavirus
Next 2024-05-19 08:38