Shared Bike Analysis System

The Shared Bike Analysis System aims to optimize the deployment and pricing strategies of shared bikes. Through the analysis of user usage data, recommendations are made for the strategic placement of bikes at high-demand times and locations. Additionally, pricing adjustments are suggested based on the usage patterns of different bike types and membership categories.
The system requires Python 3 and Java 8 (due to Spark's dependency). It includes a variety of tools and services such as Apache Spark, Hadoop, Scala, and more.

System Overview

The Shared Bike Analysis System is built on PySpark + Hive and delivers a complete workflow around Metro Bike Share historical trip data, covering data ingestion, cleaning, data warehouse construction, statistical analysis, and result export. The system aims to improve bike dispatch efficiency and optimize operational strategy, with emphasis on the following business questions:

How demand changes across different time periods (hourly granularity);
Usage distribution across trip route types, passholder types, and bike types;
Spatial characteristics of riding behavior within the city area (LA);
Structured outputs for downstream visualization and operations decision-making.

The main execution starts in main.py: after Spark/Hive session initialization, MasterController orchestrates data warehouse setup, trip data processing, statistics, and application-layer metric generation.

Functional Modules

1) Compute and Storage Foundation (`cluster_util`)

spark_util.py: builds SparkSession / SparkContext and manages runtime parameters;
hive_util.py: manages Hive tables/partitions, executes SQL, and exports results.

2) Business Orchestration (`bus_controller`)

master_controller.py: acts as the system orchestrator for data warehouse initialization, primary processing flow, statistics jobs, and resource cleanup;
trip_controller.py: handles trip-domain processing including raw CSV ingestion, field cleaning, UDF-based feature engineering, partitioned warehousing, and application-layer aggregation.

3) Shared Utilities (`common`)

Provides common capabilities such as file operations, time utilities, geospatial tools, JSON handling, and logging;
Supports trip data normalization, geolocation checks from lat/lon, and business feature computation.

4) Statistics & Visualization Helpers (`statistics_utils`)

chart_util.py provides statistical plotting utilities based on Matplotlib/Seaborn;
Enables quick visual exploration of analysis results (e.g., count distributions).

5) Data and Outputs

data/: stores raw quarterly Metro Bike Share datasets;
map_poly_json/ and geo_shape/: store geographic boundary data for region checks (e.g., LA area);
results/: stores statistical outputs and charts (e.g., correlation matrix, clustering figures).

Environment Setup

The project's main script is main.py, and Jupyter Notebook visualization is provided via analysis.ipynb.

Note: Spark only supports Java 8 (not JDK 11).

Data Source

The data directory is located at ./data. Please ensure your data files are placed here.

Our primary data source is the Metro Bike Share data site. We are only using data from the past three years, as older data may have different formats.

You can also download the data from our GitHub repository, here.

Installation

Spark, Hadoop, and Scala Setup

Configure the environment variables as follows:

# spark
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.2.0
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/libexec/python
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/build:$PYTHONPATH

# hadoop
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.1
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"

# scala
export SCALA_HOME=/usr/local/Cellar/scala/2.13.7
export PATH=$PATH:$SCALA_HOME/bin

Python Package Installation

Install the virtualenv package using pip:

pip install virtualenv

Install the jupyter notebook:

pip install jupyter

Usage

Python Virtual Environment

A venv directory is provided, which includes all necessary Python packages.

To activate the virtual environment, use: source venv/bin/activate

To exit the virtual environment, use: deactivate

Running the Code

To generate the tables, use:

spark-submit main.py

Visualization

To start jupyter notebook:

jupyter notebook

To start the visualization tool, use:

flask run --host=0.0.0.0 --port=5000

Then, access it in your browser at: http://0.0.0.0:8888/notebooks/analysis.ipynb

Troubleshooting

If you encounter issues, please check the following potential solutions:

If you encounter the error "ERROR XSDB6: Another instance of Derby may have already booted the database":
```
ps -ef | grep spark-shell
kill -9 <processID>
```
If you get the "Java version unsupported" error from org.apache.spark.storage.StorageUtils, please ensure you're using Java 8.
If you need to use a specific Java version, you can modify the java8_location variable in spark_util.py to set the JAVA_HOME for this program.
If you're having trouble with Hive, try deleting db.lck and dbex.lck in the metastore_db directory.
For other issues, please submit them here.

Project Report

The detailed report for this project is provided in "ANALYSIS OF SHARED BICYCLE OPERATION.pdf". It includes a comprehensive overview of the entire project, covering the following sections:

Introduction
Preprocessing
Analytical Framework
System Design
Evaluation
Related Research
Conclusion

Additional Information

Other Data Sources

Lyft Bay Wheels System Data

Future Work

We plan to add weather features to the analysis. Potential weather data sources include:

Related Projects

Bay Wheels Visualization

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Shared Bike Analysis System

Table of Contents

System Overview

Functional Modules

1) Compute and Storage Foundation (`cluster_util`)

2) Business Orchestration (`bus_controller`)

3) Shared Utilities (`common`)

4) Statistics & Visualization Helpers (`statistics_utils`)

5) Data and Outputs

Environment Setup

Data Source

Installation

Spark, Hadoop, and Scala Setup

Python Package Installation

Usage

Python Virtual Environment

Running the Code

Visualization

Troubleshooting

Project Report

Additional Information

Other Data Sources

Future Work

Related Projects

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 44 Commits
bus_controller		bus_controller
cluster_util		cluster_util
common		common
data		data
geo_shape		geo_shape
map_poly_json		map_poly_json
results		results
statistics_utils		statistics_utils
ANALYSIS OF SHARED BICYCLE OPERATION.pdf		ANALYSIS OF SHARED BICYCLE OPERATION.pdf
LICENSE		LICENSE
LinuxOperationInstruction.md		LinuxOperationInstruction.md
Logger.py		Logger.py
README.md		README.md
analysis.ipynb		analysis.ipynb
main.py		main.py
test.ipynb		test.ipynb
trip_stat.ipynb		trip_stat.ipynb

Folders and files

Latest commit

History

Repository files navigation

Shared Bike Analysis System

Table of Contents

System Overview

Functional Modules

1) Compute and Storage Foundation (cluster_util)

2) Business Orchestration (bus_controller)

3) Shared Utilities (common)

4) Statistics & Visualization Helpers (statistics_utils)

5) Data and Outputs

Environment Setup

Data Source

Installation

Spark, Hadoop, and Scala Setup

Python Package Installation

Usage

Python Virtual Environment

Running the Code

Visualization

Troubleshooting

Project Report

Additional Information

Other Data Sources

Future Work

Related Projects

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1) Compute and Storage Foundation (`cluster_util`)

2) Business Orchestration (`bus_controller`)

3) Shared Utilities (`common`)

4) Statistics & Visualization Helpers (`statistics_utils`)

Packages