SIDec 11, 2025
Understanding Toxic Interaction Across User and Video Clusters in Social Video PlatformsQiao Wang, Liang Liu, Mitsuo Yoshida
Social video platforms shape how people access information, while recommendation systems can narrow exposure and increase the risk of toxic interaction. Previous research has often examined text or users in isolation, overlooking the structural context in which such toxic interactions occur. Without considering who interacts with whom and around what content, it is difficult to explain why negative expressions cluster within particular communities. To address this issue, this study focuses on the Chinese social video platform Bilibili, incorporating video-level information as the environment for user expression, modeling users and videos in an interaction matrix. After normalization and dimensionality reduction, we perform separate clustering on both sides of the video-user interaction matrix with K-means. Cluster assignments facilitate comparisons of user behavior, including message length, posting frequency, and source (barrage and comment), as well as textual features such as sentiment and toxicity, and video attributes defined by uploaders. Such a clustering approach integrates structural ties with content signals to identify stable groups of videos and users. We find clear stratification in interaction style (message length, comment ratio) across user clusters, while sentiment and toxicity differences are weak or inconsistent across video clusters. Across video clusters, viewing volume exhibits a clear hierarchy, with higher exposure groups concentrating more toxic expressions. For such a group, platforms should require timely intervention during periods of rapid growth. Across user clusters, comment ratio and message length form distinct hierarchies, and several clusters with longer and comment-oriented messages exhibit lower toxicity. For such groups, platforms should strengthen mechanisms that sustain rational dialogue and encourage engagement across topics.
SIDec 11, 2025
The Circulate and Recapture Dynamic of Fan Mobility in Agency-Affiliated VTuber NetworksTomohiro Murakami, Mitsuo Yoshida
VTuber agencies -- multichannel networks (MCNs) that bundle Virtual YouTubers (VTubers) on YouTube -- curate portfolios of channels and coordinate programming, cross appearances, and branding in the live-streaming VTuber ecosystem. It remains unclear whether affiliation binds fans to a single channel or instead encourages movement within a portfolio that buffers exit, and how these micro level dynamics relate to meso level audience overlap. This study examines how affiliation shapes short horizon viewer trajectories and the organization of audience overlap networks by contrasting agency affiliated and independent VTubers. Using a large, multiyear, fan centered panel of VTuber live stream engagement on YouTube, we construct monthly audience overlap between creators with a similarity measure that is robust to audience size asymmetries. At the micro level, we track retention, changes in the primary creator watched (oshi), and inactivity; at the meso level, we compare structural properties of affiliation specific subgraphs and visualize viewer state transitions. The analysis identifies a pattern of loose mobility: fans tend to remain active while reallocating attention within the same affiliation type, with limited leakage across affiliation type. Network results indicate convergence in global overlap while local neighborhoods within affiliated subgraphs remain persistently denser. Flow diagrams reveal circulate and recapture dynamics that stabilize participation without relying on single channel lock in. We contribute a reusable measurement framework for VTuber live streaming that links micro level trajectories to meso level organization and informs research on creator labor, influencer marketing, and platform governance on video platforms. We do not claim causal effects; the observed regularities are consistent with proximity engineered by VTuber agencies and coordinated recapture.
CLNov 5, 2021
Feature Selective Likelihood Ratio Estimator for Low- and Zero-frequency N-gramsMasato Kikuchi, Mitsuo Yoshida, Kyoji Umemura et al.
In natural language processing (NLP), the likelihood ratios (LRs) of N-grams are often estimated from the frequency information. However, a corpus contains only a fraction of the possible N-grams, and most of them occur infrequently. Hence, we desire an LR estimator for low- and zero-frequency N-grams. One way to achieve this is to decompose the N-grams into discrete values, such as letters and words, and take the product of the LRs for the values. However, because this method deals with a large number of discrete values, the running time and memory usage for estimation are problematic. Moreover, use of unnecessary discrete values causes deterioration of the estimation accuracy. Therefore, this paper proposes combining the aforementioned method with the feature selection method used in document classification, and shows that our estimator provides effective and efficient estimation results for low- and zero-frequency N-grams.
CLOct 3, 2021
Unified Likelihood Ratio Estimation for High- to Zero-frequency N-gramsMasato Kikuchi, Kento Kawakami, Kazuho Watanabe et al.
Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of $N$ items, called an $N$-gram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on $N$-gram frequency information. A naive estimation approach that uses only $N$-gram frequencies is sensitive to low-frequency (rare) $N$-grams and not applicable to zero-frequency (unobserved) $N$-grams; these are known as the low- and zero-frequency problems, respectively. To address these problems, we propose a method for decomposing $N$-grams into item units and then applying their frequencies along with the original $N$-gram frequencies. Our method can obtain the estimates of unobserved $N$-grams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often co-occur in practice and therefore maintains their dependencies by using the relevant $N$-gram frequencies. We also introduce a regularization to achieve robust estimation for rare $N$-grams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies.
IRDec 27, 2020
Analysis of Short Dwell Time in Relation to User Interest in a News ApplicationRyosuke Homma, Yoshifumi Seki, Mitsuo Yoshida et al.
Dwell time has been widely used in various fields to evaluate content quality and user engagement. Although many studies shown that content with long dwell time is good quality, contents with short dwell time have not been discussed in detail. We hypothesize that content with short dwell time is not always low quality and does not always have low user engagement, but is instead related to user interest. The purpose of this study is to clarify the meanings of short dwell time browsing in mobile news application. First, we analyze the relation of short dwell time to user interest using large scale user behavior logs from a mobile news application. This analysis was conducted on a vector space based on users click histories and then users and articles were mapped in the same space. The users with short dwell time are concentrated on a specific position in this space; thus, the length of dwell time is related to their interest. Moreover, we also analyze the characteristics of short dwell time browsing by excluding these browses from their click histories. Surprisingly, excluding short dwell time click history, it was found that short dwell time click history included some aspect of user interest in 30.87% of instances where the cluster of users changed. These findings demonstrate that short dwell time does not always indicate a low level of user engagement, but also level of user interest.
IRDec 27, 2020
The metrics of keywords to understand the difference between Retweet and Like in each categoryKenshin Sekimoto, Yoshifumi Seki, Mitsuo Yoshida et al.
The purpose of this study is to clarify what kind of news is easily retweeted and what kind of news is easily Liked. We believe these actions, retweeting and Liking, have different meanings for users. Understanding this difference is important for understanding people's interest in Twitter. To analyze the difference between retweets (RT) and Likes on Twitter in detail, we focus on word appearances in news titles. First, we calculate basic statistics and confirm that tweets containing news URLs have different RT and Like tendencies compared to other tweets. Next, we compared RTs and Likes for each category and confirmed that the tendency of categories is different. Therefore, we propose metrics for clarifying the differences in each action for each category used in the $χ$-square test in order to perform an analysis focusing on the topic. The proposed metrics are more useful than simple counts and TF-IDF for extracting meaningful words to understand the difference between RTs and Likes. We analyzed each category using the proposed metrics and quantitatively confirmed that the difference in the role of retweeting and Liking appeared in the content depending on the category. Moreover, by aggregating tweets chronologically, the results showed the trend of RT and Like as a list of words and clarified how the characteristic words of each week were related to current events for retweeting and Liking.
HCAug 22, 2020
Brushing Feature Values in Immersive Graph Visualization EnvironmentHinako Sassa, Maxime Cordeil, Mitsuo Yoshida et al.
There are a variety of graphs where multidimensional feature values are assigned to the nodes. Visualization of such datasets is not an easy task since they are complex and often huge. Immersive Analytics is a powerful approach to support the interactive exploration of such large and complex data. Many recent studies on graph visualization have applied immersive analytics frameworks. However, there have been few studies on immersive analytics for visualization of multidimensional attributes associated with the input graphs. This paper presents a new immersive analytics system that supports the interactive exploration of multidimensional feature values assigned to the nodes of input graphs. The presented system displays label-axes corresponding to the dimensions of feature values, and label-edges that connect label-axes and corresponding to the nodes. The system supports brushing operations which controls the display of edges that connect a label-axis and nodes of the graph. This paper introduces visualization examples with a graph dataset of Twitter users and reviews by experts on graph data analysis.
CYSep 2, 2019
Analysis of Bias in Gathering Information Between User Attributes in News ApplicationYoshifumi Seki, Mitsuo Yoshida
In the process of information gathering on the web, confirmation bias is known to exist, exemplified in phenomena such as echo chambers and filter bubbles. Our purpose is to reveal how people consume news and discuss these phenomena. In web services, we are able to use action logs of a service to investigate these phenomena. However, many existing studies about these phenomena are conducted via questionnaires, and there are few studies using action logs. In this paper, we attempt to discover biases of information gathering due to differences in user demographic attributes, such as age and gender, from the behavior log of the news distribution service. First, we summarized the actions in the service for each user attribute and showed the difference of user behavior depending on the attributes. Next, the degree of correlation between the attributes was measured using the correlation coefficient, and a strong correlation was found to exist in the browsing tendency of the news articles between the attributes. Then, the bias of keywords between attributes was discovered, keywords with bias in behavior among the attributes were found using parameters of regression analysis. Since these discovered keywords are almost explainable by big news, our proposed method is effective in detecting biased keywords.
CYAug 23, 2019
Analysis of User Dwell Time by Category in News ApplicationYoshifumi Seki, Mitsuo Yoshida
Dwell time indicates how long a user looked at a page, and this is used especially in fields where ratings from users such as search engines, recommender systems, and advertisements are important. Despite the importance of this index, however, its characteristics are not well known. In this paper, we analyze the dwell time of news pages according to category in smartphone application. Our aim is to clarify the characteristics of dwell time and the relation between length of news page and dwell time, for each category. The results indicated different dwell time trends for each category. For example, the social category had fewer news pages with shorter dwell time than peaks, compared to other categories, and there were a few news pages with remarkably short dwell time. We also found a large difference by category in the correlation value between dwell time and length of news page. Specifically, political news had the highest correlation value and technology news had the lowest. In addition, we found that a user tends to get sufficient information about the news content from the news title in short dwell times.
CLJun 11, 2019
Journal Name Extraction from Japanese Scientific News ArticlesMasato Kikuchi, Mitsuo Yoshida, Kyoji Umemura
In Japanese scientific news articles, although the research results are described clearly, the article's sources tend to be uncited. This makes it difficult for readers to know the details of the research. In this paper, we address the task of extracting journal names from Japanese scientific news articles. We hypothesize that a journal name is likely to occur in a specific context. To support the hypothesis, we construct a character-based method and extract journal names using this method. This method only uses the left and right context features of journal names. The results of the journal name extractions suggest that the distribution hypothesis plays an important role in identifying the journal names.
HCMar 1, 2019
Analysis of User Dwell Time on Non-News PagesRyosuke Homma, Keiichi Soejima, Mitsuo Yoshida et al.
There is dwell time as one of the indicators of user's behavior, and this indicates how long a user looked at a page. Dwell time is especially useful in fields where user ratings are important, such as search engines, recommender systems, and advertisements are important. Despite the importance of this index, however, its characteristics are not well known. In this paper, we analyze the dwell times of various websites by desktop and mobile devices using data of one year. Our aim is to clarify the characteristics of dwell time on non-news websites in order to discover which features are effective for predicting the dwell time. In this analysis, we focus on device types, access times, behavior on the website, and scroll depth. The results indicated that the number of sessions decreased as the dwell time increased, for both desktop and mobile devices. We also found that hour and month greatly affected the dwell time, but day of the week had little effect. Moreover, we discovered that inside and click users tended to have longer dwell times than outside and non-click users. However, we can not find a relationship between dwell time and scroll depth. This is because even if a user browsed the bottom of the page, the user might not necessarily have read the entire page.
SDApr 16, 2018
Computing Information Quantity as Similarity Measure for Music Classification TaskAyaka Takamoto, Mitsuo Yoshida, Kyoji Umemura et al.
This paper proposes a novel method that can replace compression-based dissimilarity measure (CDM) in composer estimation task. The main features of the proposed method are clarity and scalability. First, since the proposed method is formalized by the information quantity, reproduction of the result is easier compared with the CDM method, where the result depends on a particular compression program. Second, the proposed method has a lower computational complexity in terms of the number of learning data compared with the CDM method. The number of correct results was compared with that of the CDM for the composer estimation task of five composers of 75 piano musical scores. The proposed method performed better than the CDM method that uses the file size compressed by a particular program.
SDOct 4, 2017
Improving Compression Based Dissimilarity Measure for Music Score AnalysisAyaka Takamoto, Mayu Umemura, Mitsuo Yoshida et al.
In this paper, we propose a way to improve the compression based dissimilarity measure, CDM. We propose to use a modified value of the file size, where the original CDM uses an unmodified file size. Our application is a music score analysis. We have chosen piano pieces from five different composers. We have selected 75 famous pieces (15 pieces for each composer). We computed the distances among all pieces by using the modified CDM. We use the K-nearest neighbor method when we estimate the composer of each piece of music. The modified CDM shows improved accuracy. The difference is statistically significant.
DSSep 26, 2017
Polysemy Detection in Distributed Representation of Word SenseKana Oomoto, Haruka Oikawa, Eiko Yamamoto et al.
In this paper, we propose a statistical test to determine whether a given word is used as a polysemic word or not. The statistic of the word in this test roughly corresponds to the fluctuation in the senses of the neighboring words a nd the word itself. Even though the sense of a word corresponds to a single vector, we discuss how polysemy of the words affects the position of vectors. Finally, we also explain the method to detect this effect.
CVSep 25, 2017
Realizing Half-Diminished Reality from Video Stream of Manipulating ObjectsHayato Okumoto, Mitsuo Yoshida, Kyoji Umemura
When we watch a video, in which human hands manipulate objects, these hands may obscure some parts of those objects. We are willing to make clear how the objects are manipulated by making the image of hands semi-transparent, and showing the complete images of the hands and the object. By carefully choosing a Half-Diminished Reality method, this paper proposes a method that can process the video in real time and verifies that the proposed method works well.
SISep 3, 2017
Home Location Estimation Using Weather Observation DataYuki Kondo, Masatsugu Hangyo, Mitsuo Yoshida et al.
We can extract useful information from social media data by adding the user's home location. However, since the user's home location is generally not publicly available, many researchers have been attempting to develop a more accurate home location estimation. In this study, we propose a method to estimate a Twitter user's home location by using weather observation data from AMeDAS. In our method, we first estimate the weather of the area posted by an estimation target user by using the tweet, Next, we check out the estimated weather against weather observation data, and narrow down the area posted by the user. Finally, the user's home location is estimated as which areas the user frequently posts from. In our experiments, the results indicate that our method functions effectively and also demonstrate that accuracy improves under certain conditions.
SISep 7, 2015
Wikipedia Page View Reflects Web Search TrendMitsuo Yoshida, Yuki Arase, Takaaki Tsunoda et al.
The frequency of a web search keyword generally reflects the degree of public interest in a particular subject matter. Search logs are therefore useful resources for trend analysis. However, access to search logs is typically restricted to search engine providers. In this paper, we investigate whether search frequency can be estimated from a different resource such as Wikipedia page views of open data. We found frequently searched keywords to have remarkably high correlations with Wikipedia page views. This suggests that Wikipedia page views can be an effective tool for determining popular global web search trends.