Learn how quickly you can start ingesting 和 aggregating Clickstream logs on Amazon EMR using 电子游戏厅 Transformer Engine, 和 see how the data is analyzed in Elasticsearch, Kibana, 和 亚马逊红移.
What is Clickstream Analysis?
Clickstream analysis by definition is the process of collecting, 分析, 和 reporting aggregate information about webpage visits. In this blog, we will review how 电子游戏厅 Transformer Engine, a 火花ETL engine, running on Amazon EMR can help ingest 和 aggregate Clickstream logs.
Here are the details of the dataset 和 数据管道 组件:
- Dataset 和 Data Source: Clickstream logs read from Amazon S3
- 转换:包括 聚合,如:
- Number of views for each session with respect to action for a specific URL
- The total number of sessions for each client IP address
- Number of events captured for each br和 of products
- 聚合 are stored in 亚马逊红移 表. (Note: if the 表 don’t already exist, the destination can be configured for the 表 to be auto-created.)
- All the logs are sent to Elasticsearch for searching 和 quick visualizations in Kibana. (Note: if the index doesn’t already exist in Elasticsearch, the destination can be configured for the index to be auto-created.)
Here are the aggregated stats being collected 和 stored in 亚马逊红移.
Elasticsearch 和 Kibana
Once the logs were available in Elasticsearch, I created an index pattern called clickstream_data with all the attributes of the logs.
使用 clickstream_data index pattern as source, I then created a dashboard with different visualizations in Kibana.
- Session Wise Analysis — Number of views for each session with respect to action for a specific URL
- Client Wise Analysis — The total number of sessions for each client IP address.
- 品牌分析 — Number of events captured for each br和 of products
- HTTP Response Analysis — Number of events captured with a response status such as Successful, 请求了, 没有响应, 错误响应, 等.
聚合 are stored in number_of_views_per_session, number_of_sessions_per_ip, 和 number_of_events_per_http_response Redshift 表 for 快-querying. For example, query to see the top 5 IP addresses from where the HTTP sessions were initiated.
SELECT distinct(clientip),total_sessions from number_of_sessions_per_ip order by total_sessions desc limit 5
Watch Demo Video
While there are different ways to dissect 和 analyze data, hopefully this blog 和 demo video gives you ideas on how to use some of these tools you might have at your disposal in order to make better, data-driven decisions, 快.