I am going to start my bachelor's thesis in the next couple days, hopefully. Originally, I planned on starting today, but my supervisor still needs a bit of time to review the plan that I have. Previously I did an internship where I collected huge amounts of data from Twitter, which will now be analyzed. The overarching theme of this work is to create a social media corporate genome that can encode/describe a company. The work I am doing is a first step towards this, limiting the scope to data supplied by Twitter.
The idea is to learn an interesting embedding from the social network graph that can cluster accounts from the same company together – or encode relationships between accounts via shifts in the latent space. The technique that is most likely to be implemented is node2vec, although an unsupervised autoencoder for graphs would be interesting, too. The original authors already supply an implementation, which is unfortunately not feasible to use since the amount of data that I want to process requires a GPU to run in a reasonable time. I am thinking about either implementing the algorithm in TensorFlow or in nvGRAPH, a graph library from Nvidia. As a result, the contribution of this thesis lies in the dataset that has been gathered in the previous months.
The data still needs to be cleaned up a fair bit. For example, I could remove all tweets that do not mention anyone else. This would simplify the network but also remove possibly important relations, as an account that posts often should be distinguished from one that rarely does so. Another option could be the removal of accounts that do not interact with anyone else. This should be safe to do, but will require a bit of effort. Generally, cleaning up will require the graph to exist as a whole, otherwise any analysis will simply not be possible.
Currently, the graph is too big to be able to keep its entirety in memory. Hopefully the cleaning process slims the finished graph down to a size that can fit into the RAM. The nvGRAPH website reports the ability to handle up to 2 billion edges, and currently I am at three billion – before cleaning up. This is going to be interesting to see; whether I am able to reduce the count or if the library can handle the amount of edges that I currently have.
I started a series for my Bachelor's. Hopefully I can write more posts, this could serve me well once I need to actually write down the results for my diploma.