skip to Main Content

The DataOps 博客

Where Change Is Welcome

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

By 张贴在 工程 2020年4月23日

Learn how quickly you can start ingesting 和 aggregating Clickstream logs on Amazon EMR using 电子游戏厅 Transformer Engine, 和 see how the data is analyzed in Elasticsearch, Kibana, 和 亚马逊红移.

What is Clickstream Analysis?

Clickstream analysis by definition is the process of collecting, 分析, 和 reporting aggregate information about webpage visits. In this blog, we will review how 电子游戏厅 Transformer Engine, a 火花ETL engine, running on Amazon EMR can help ingest 和 aggregate Clickstream logs.

Pipeline Overview

Here are the details of the dataset 和 数据管道 组件:

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

  • Dataset 和 Data Source: Clickstream logs read from Amazon S3
  • 转换:包括 聚合,如:
    • Number of views for each session with respect to action for a specific URL
    • The total number of sessions for each client IP address
    • Number of events captured for each br和 of products
  • 目的地:
    • 聚合 are stored in 亚马逊红移 表. (Note: if the 表 don’t already exist, the destination can be configured for the 表 to be auto-created.)
    • All the logs are sent to Elasticsearch for searching 和 quick visualizations in Kibana. (Note: if the index doesn’t already exist in Elasticsearch, the destination can be configured for the index to be auto-created.)

聚合

Here are the aggregated stats being collected 和 stored in 亚马逊红移.

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

Elasticsearch 和 Kibana

Once the logs were available in Elasticsearch, I created an index pattern called clickstream_data with all the attributes of the logs.

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

使用 clickstream_data index pattern as source, I then created a dashboard with different visualizations in Kibana.

  • Session Wise Analysis — Number of views for each session with respect to action for a specific URL
  • Client Wise Analysis — The total number of sessions for each client IP address.
  • 品牌分析 — Number of events captured for each br和 of products
  • HTTP Response Analysis — Number of events captured with a response status such as Successful, 请求了, 没有响应, 错误响应, 等.

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

Querying 亚马逊红移

聚合 are stored in number_of_views_per_session, number_of_sessions_per_ip, 和 number_of_events_per_http_response Redshift 表 for 快-querying. For example, query to see the top 5 IP addresses from where the HTTP sessions were initiated.

SELECT distinct(clientip),total_sessions from number_of_sessions_per_ip order by total_sessions desc limit 5

Clickstream Analysis on Amazon EMR, 亚马逊红移 和 Elasticsearch

Watch Demo Video

Summary

While there are different ways to dissect 和 analyze data, hopefully this blog 和 demo video gives you ideas on how to use some of these tools you might have at your disposal in order to make better, data-driven decisions, 快.

Learn more about 电子游戏厅 For AWS电子游戏厅 Transformer Engine. Also checkout getting started resources to jumpstart designing your 火花ETL 管道.

回到顶部

We use cookies to improve your experience with our website. Click 允许所有 to consent 和 continue to our site. 隐私政策