Use of statistical and analytical methods for solving problems of «big data»
DOI №______
Abstract
The analysis of large data is increasingly becoming a popular practice, which is accepted by many organizations to generate valuable information from large volumes of data. Many data represent opportunities, and some challenges for statisticians. After all, using the data we deal with many problems, for example, expensive equipment, many non-structured data, which prevents us from quickly finding valuable information, speed, as we work with billions of gigabytes. From the point of view of data processing in regression analysis, the key operations are the calculation of the current total difference and the adjustment of parameter values. If the first operation is parallelized in an obvious way, then the second is more complicated. In the most general case, when adjusting weights, a well-known mathematical fact is used: the function of several parameters increases in the direction of the gradient and decreases in the direction opposite to the gradient. In turn, the calculation of the gradient consists in the calculation of the partial derivatives of the function for each of the parameters, which is reduced to discrete differentiation based on the calculation of weighted sums. As a result, the adjustment of parameter values is also reduced to summation, which can be parallelized. The problem of Big Data clustering is that the existing algorithms imply the possibility of directly referring to any information entity in the source data. In turn, the source data can be distributed across different servers, and it is not guaranteed that each cluster is stored strictly on one server. If the distribution of data across servers is made transparent to the clustering algorithm, he believes that the data is located in some distributed virtual memory, then this will inevitably lead to copying large amounts from one server to another. This article will discuss statistical and analytical methods to combat «bad» data.
Keywords: large data (big data); analytics; analysis of big data; sampling; statistical methods.
References
1. Добре С., Кхафа Ф. Інтелектуальні послуги для великих наукових даних // Комп’ютерні системи майбутнього покоління. 2014. Т. 37. С. 267–281.
2. Akerkar R. Big data computing // CRC Press, Taylor & Francis Group, Florida, USA (2014).
3. Котельников В. А. О пропускной способности эфира и проволоки в электросвязи // Всесоюз. энергетический комитет: материалы I Всесоюзного съезда по вопросам технической реконструкции дела связи и развития слаботочной промышленности, 1933.
4. Ma P., Sun X. Leveraging for Big Data Regression // WIREs Computational Statistics. 2014. С. 70–76.
5. A Resampling-Based Stochastic Approximation Method for Analysis of Large Geostatistical Data / F. Liang, Y. Cheng, Q. Song а. о. // Journal of the American Statistical Association. 2013. Т. 108. С. 325–339.
6. Online Updating of Statistical Inference in the Big Data Setting / E. D. Schifano, J. Wu, C. Wang а. о. // Technometrics. 2015.
7. Massively parallel feature selection: an approach based on variance preservation / Z. Zhao, R. Zhang, J. Cox а. о. // Machine Learning. 2013. Т. 92. Р. 195–220.