65.3IRMar 14
The Reasoning Bottleneck in Graph-RAG: Structured Prompting and Context Compression for Multi-Hop QAYasaman Zarinkia, Venkatesh Srinivasan, Alex Thomo
Graph-RAG systems achieve strong multi-hop question answering by indexing documents into knowledge graphs, but strong retrieval does not guarantee strong answers. Evaluating KET-RAG, a leading Graph-RAG system, on three multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA), we find that 77% to 91% of questions have the gold answer in the retrieved context, yet accuracy is only 35% to 78%, and 73% to 84% of errors are reasoning failures. We propose two augmentations: (i) SPARQL chain-of-thought prompting, which decomposes questions into triple-pattern queries aligned with the entity-relationship context, and (ii) graph-walk compression, which compresses the context by ~60% via knowledge-graph traversal with no LLM calls. SPARQL CoT improves accuracy by +2 to +14 pp; graph-walk compression adds +6 pp on average when paired with structured prompting on smaller models. Surprisingly, we show that, with question-type routing, a fully augmented budget open-weight Llama-8B model matches or exceeds the unaugmented Llama-70B baseline on all three benchmarks at ~12x lower cost. A replication on LightRAG confirms that our augmentations transfer across Graph-RAG systems.
MLAug 13, 2021
Multi-Stage Graph Peeling Algorithm for Probabilistic Core DecompositionYang Guo, Xuekui Zhang, Fatemeh Esfahani et al.
Mining dense subgraphs where vertices connect closely with each other is a common task when analyzing graphs. A very popular notion in subgraph analysis is core decomposition. Recently, Esfahani et al. presented a probabilistic core decomposition algorithm based on graph peeling and Central Limit Theorem (CLT) that is capable of handling very large graphs. Their proposed peeling algorithm (PA) starts from the lowest degree vertices and recursively deletes these vertices, assigning core numbers, and updating the degree of neighbour vertices until it reached the maximum core. However, in many applications, particularly in biology, more valuable information can be obtained from dense sub-communities and we are not interested in small cores where vertices do not interact much with others. To make the previous PA focus more on dense subgraphs, we propose a multi-stage graph peeling algorithm (M-PA) that has a two-stage data screening procedure added before the previous PA. After removing vertices from the graph based on the user-defined thresholds, we can reduce the graph complexity largely and without affecting the vertices in subgraphs that we are interested in. We show that M-PA is more efficient than the previous PA and with the properly set filtering threshold, can produce very similar if not identical dense subgraphs to the previous PA (in terms of graph density and clustering coefficient).
CROct 12, 2020
MMH* with arbitrary modulus is always almost-universalKhodakhast Bibak, Bruce M. Kapron, Venkatesh Srinivasan
Universal hash functions, discovered by Carter and Wegman in 1979, are of great importance in computer science with many applications. MMH$^*$ is a well-known $\triangle$-universal hash function family, based on the evaluation of a dot product modulo a prime. In this paper, we introduce a generalization of MMH$^*$, that we call GMMH$^*$, using the same construction as MMH$^*$ but with an arbitrary integer modulus $n>1$, and show that GMMH$^*$ is $\frac{1}{p}$-almost-$\triangle$-universal, where $p$ is the smallest prime divisor of $n$. This bound is tight.
SESep 21, 2018
Predicting the Programming Language of Questions and Snippets of StackOverflow Using Natural Language ProcessingKamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German et al.
Stack Overflow is the most popular Q&A website among software developers. As a platform for knowledge sharing and acquisition, the questions posted in Stack Overflow usually contain a code snippet. Stack Overflow relies on users to properly tag the programming language of a question and it simply assumes that the programming language of the snippets inside a question is the same as the tag of the question itself. In this paper, we propose a classifier to predict the programming language of questions posted in Stack Overflow using Natural Language Processing (NLP) and Machine Learning (ML). The classifier achieves an accuracy of 91.1% in predicting the 24 most popular programming languages by combining features from the title, body and the code snippets of the question. We also propose a classifier that only uses the title and body of the question and has an accuracy of 81.1%. Finally, we propose a classifier of code snippets only that achieves an accuracy of 77.7%. These results show that deploying Machine Learning techniques on the combination of text and the code snippets of a question provides the best performance. These results demonstrate also that it is possible to identify the programming language of a snippet of few lines of source code. We visualize the feature space of two programming languages Java and SQL in order to identify some special properties of information inside the questions in Stack Overflow corresponding to these languages.
SESep 21, 2018
SCC: Automatic Classification of Code SnippetsKamel Alreshedy, Dhanush Dharmaretnam, Daniel M. German et al.
Determining the programming language of a source code file has been considered in the research community; it has been shown that Machine Learning (ML) and Natural Language Processing (NLP) algorithms can be effective in identifying the programming language of source code files. However, determining the programming language of a code snippet or a few lines of source code is still a challenging task. Online forums such as Stack Overflow and code repositories such as GitHub contain a large number of code snippets. In this paper, we describe Source Code Classification (SCC), a classifier that can identify the programming language of code snippets written in 21 different programming languages. A Multinomial Naive Bayes (MNB) classifier is employed which is trained using Stack Overflow posts. It is shown to achieve an accuracy of 75% which is higher than that with Programming Languages Identification (PLI a proprietary online classifier of snippets) whose accuracy is only 55.5%. The average score for precision, recall and the F1 score with the proposed tool are 0.76, 0.75 and 0.75, respectively. In addition, it can distinguish between code snippets from a family of programming languages such as C, C++ and C#, and can also identify the programming language version such as C# 3.0, C# 4.0 and C# 5.0.
CRJul 8, 2015
On an almost-universal hash function family with applications to authentication and secrecy codesKhodakhast Bibak, Bruce M. Kapron, Venkatesh Srinivasan et al.
Universal hashing, discovered by Carter and Wegman in 1979, has many important applications in computer science. MMH$^*$, which was shown to be $Δ$-universal by Halevi and Krawczyk in 1997, is a well-known universal hash function family. We introduce a variant of MMH$^*$, that we call GRDH, where we use an arbitrary integer $n>1$ instead of prime $p$ and let the keys $\mathbf{x}=\langle x_1, \ldots, x_k \rangle \in \mathbb{Z}_n^k$ satisfy the conditions $\gcd(x_i,n)=t_i$ ($1\leq i\leq k$), where $t_1,\ldots,t_k$ are given positive divisors of $n$. Then via connecting the universal hashing problem to the number of solutions of restricted linear congruences, we prove that the family GRDH is an $\varepsilon$-almost-$Δ$-universal family of hash functions for some $\varepsilon<1$ if and only if $n$ is odd and $\gcd(x_i,n)=t_i=1$ $(1\leq i\leq k)$. Furthermore, if these conditions are satisfied then GRDH is $\frac{1}{p-1}$-almost-$Δ$-universal, where $p$ is the smallest prime divisor of $n$. Finally, as an application of our results, we propose an authentication code with secrecy scheme which strongly generalizes the scheme studied by Alomair et al. [{\it J. Math. Cryptol.} {\bf 4} (2010), 121--148], and [{\it J.UCS} {\bf 15} (2009), 2937--2956].
NTMar 5, 2015
Restricted linear congruencesKhodakhast Bibak, Bruce M. Kapron, Venkatesh Srinivasan et al.
In this paper, using properties of Ramanujan sums and of the discrete Fourier transform of arithmetic functions, we give an explicit formula for the number of solutions of the linear congruence $a_1x_1+\cdots +a_kx_k\equiv b \pmod{n}$, with $\gcd(x_i,n)=t_i$ ($1\leq i\leq k$), where $a_1,t_1,\ldots,a_k,t_k, b,n$ ($n\geq 1$) are arbitrary integers. As a consequence, we derive necessary and sufficient conditions under which the above restricted linear congruence has no solutions. The number of solutions of this kind of congruence was first considered by Rademacher in 1925 and Brauer in 1926, in the special case of $a_i=t_i=1$ $(1\leq i \leq k)$. Since then, this problem has been studied, in several other special cases, in many papers; in particular, Jacobson and Williams [{\it Duke Math. J.} {\bf 39} (1972), 521--527] gave a nice explicit formula for the number of such solutions when $(a_1,\ldots,a_k)=t_i=1$ $(1\leq i \leq k)$. The problem is very well-motivated and has found intriguing applications in several areas of mathematics, computer science, and physics, and there is promise for more applications/implications in these or other directions.
IRAug 7, 2012
The Best Answers? Think Twice: Online Detection of Commercial Campaigns in the CQA ForumsCheng Chen, Kui Wu, Venkatesh Srinivasan et al.
In an emerging trend, more and more Internet users search for information from Community Question and Answer (CQA) websites, as interactive communication in such websites provides users with a rare feeling of trust. More often than not, end users look for instant help when they browse the CQA websites for the best answers. Hence, it is imperative that they should be warned of any potential commercial campaigns hidden behind the answers. However, existing research focuses more on the quality of answers and does not meet the above need. In this paper, we develop a system that automatically analyzes the hidden patterns of commercial spam and raises alarms instantaneously to end users whenever a potential commercial campaign is detected. Our detection method integrates semantic analysis and posters' track records and utilizes the special features of CQA websites largely different from those in other types of forums such as microblogs or news reports. Our system is adaptive and accommodates new evidence uncovered by the detection algorithms over time. Validated with real-world trace data from a popular Chinese CQA website over a period of three months, our system shows great potential towards adaptive online detection of CQA spams.