Kusum Lata

4.2CLJul 13, 2024

Building pre-train LLM Dataset for the INDIC Languages: a case study on Hindi

Shantipriya Parida, Shakshi Panwar, Kusum Lata et al.

Large language models (LLMs) demonstrated transformative capabilities in many applications that require automatically generating responses based on human instruction. However, the major challenge for building LLMs, particularly in Indic languages, is the availability of high-quality data for building foundation LLMs. In this paper, we are proposing a large pre-train dataset in Hindi useful for the Indic language Hindi. We have collected the data span across several domains including major dialects in Hindi. The dataset contains 1.28 billion Hindi tokens. We have explained our pipeline including data collection, pre-processing, and availability for LLM pre-training. The proposed approach can be easily extended to other Indic and low-resource languages and will be available freely for LLM pre-training and LLM research purposes.

3.8CRNov 29, 2021

Hardware Software Co-design framework for Data Encryption in Image Processing Systems for the Internet of Things Environmen

Kusum Lata, Surbhi Chhabra, Sandeep Saini

Data protection is a severe constraint in the heterogeneous IoT era. This article presents a Hardware-Software Co-Simulation of AES-128 bit encryption and decryption for IoT Edge devices using the Xilinx System Generator (XSG). VHDL implementation of AES-128 bit algorithm is done with ECB and CTR mode using loop unrolled and FSM-based architecture. It is found that AES-CTR and FSM architecture performance is better than loop unrolled architecture with lesser power consumption and area. For performing the Hardware-Software Co-Simulation on Zedboard and Kintex-Ultra scale KCU105 Evaluation Platform, Xilinx Vivado 2016.2 and MATLAB 2015b is used. Hardware emulation is done for grey images successfully. To give a practical example of the usage of proposed framework, we have applied it for Biomedical Images (CTScan Image) as a case study. Security analysis in terms of the histogram, correlation, information entropy analysis, and keyspace analysis using exhaustive search and key sensitivity tests is also done to encrypt and decrypt images successfully.

Kusum Lata

2 Papers