The site Footnote 2 was applied as a way to get tweet-ids Footnote step three , this great site brings boffins which have metadata out-of a (third-party-collected) corpus out-of Dutch tweets (Tjong Kim Carried out and you can Van den Bosch, 2013). elizabeth., the fresh historic maximum whenever asking for tweets predicated on a quest ask). The newest Roentgen-package ‘rtweet’ and complementary ‘lookup_status’ form were utilized to gather tweets in JSON format. This new JSON file constitutes a desk with the tweets’ advice, such as the production go out, the new tweet text, together with supply (we.e., types of Fb consumer).
Studies clean up and you can preprocessing
The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as profiles who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.
The fresh tweet texts was in fact changed into ASCII encoding. URLs, range vacations, tweet headers, display screen brands, and you will records in order to monitor labels were got rid of. URLs increase the profile matter when discover within the tweet. Although not, URLs don’t increase the character matter if they are found at the termination of a beneficial tweet. To end a good misrepresentation of one’s genuine profile restriction one pages suffered with, tweets that have URLs (but not news URLs instance added images otherwise movies) was omitted.
Token and you can bigram studies
The fresh new R plan Footnote 5 ‘quanteda’ was used so you can tokenize the fresh tweet texts on tokens (i.elizabeth., separated terms, punctuation s. In addition, token-frequency-matrices were computed with: the brand new regularity pre-CLC [f(token pre)], the brand new relative frequency pre-CLC[P (token pre)], the fresh volume post-CLC [f(token article)], this new cousin volume article-CLC and you will T-score. The brand new T-test is like a basic T-fact and you will computes the fresh new statistical difference in means (we.elizabeth., new cousin word wavelengths). Negative T-score mean a fairly large density off a good token pre-CLC, whereas self-confident T-ratings imply a fairly higher density out-of a great token post-CLC. The T-rating equation included in the study try displayed just like the Eq. (1) and you may (2). N ‘s the total number away from tokens for each dataset (we.elizabeth Halifax sugar babies., before and after-CLC). It equation will be based upon the procedure to have linguistic data by the Chapel ainsi que al. (1991; Tjong Kim Sang, 2011).
Part-of-message (POS) investigation
Brand new Roentgen plan Footnote 6 ‘openNLP’ was utilized to help you categorize and you may count POS categories on tweets (we.age., adjectives, adverbs, blogs, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you can various). The new POS tagger works having fun with an optimum entropy (maxent) opportunities model so you’re able to predict this new POS group predicated on contextual has actually (Ratnaparkhi, 1996). The new Dutch maxent model useful for the latest POS category try educated on CoNLL-X Alpino Dutch Treebank analysis (Buchholz and ). Brand new openNLP POS design has been claimed having an accuracy rating regarding 87.3% whenever employed for English social networking study (Horsmann mais aussi al., 2015). A keen ostensible restrict of current analysis is the accuracy regarding this new POS tagger. However, equivalent analyses have been did both for pre-CLC and you can article-CLC datasets, meaning the accuracy of one’s POS tagger will be uniform more than both datasets. Hence, i guess there aren’t any scientific confounds.