The Shared Bike Analysis System aims to optimize the deployment and pricing strategies of shared bikes. Through the analysis of user usage data, recommendations are made for the strategic placement of bikes at high-demand times and locations. Additionally, pricing adjustments are suggested based on the usage patterns of different bike types and membership categories.
The system requires Python 3 and Java 8 (due to Spark's dependency). It includes a variety of tools and services such as Apache Spark, Hadoop, Scala, and more.
- System Overview
- Functional Modules
- Environment Setup
- Data Source
- Installation
- Usage
- Troubleshooting
- Project Report
- Additional Information
The Shared Bike Analysis System is built on PySpark + Hive and delivers a complete workflow around Metro Bike Share historical trip data, covering data ingestion, cleaning, data warehouse construction, statistical analysis, and result export. The system aims to improve bike dispatch efficiency and optimize operational strategy, with emphasis on the following business questions:
- How demand changes across different time periods (hourly granularity);
- Usage distribution across trip route types, passholder types, and bike types;
- Spatial characteristics of riding behavior within the city area (LA);
- Structured outputs for downstream visualization and operations decision-making.
The main execution starts in main.py: after Spark/Hive session initialization, MasterController orchestrates data warehouse setup, trip data processing, statistics, and application-layer metric generation.
spark_util.py: builds SparkSession / SparkContext and manages runtime parameters;hive_util.py: manages Hive tables/partitions, executes SQL, and exports results.
master_controller.py: acts as the system orchestrator for data warehouse initialization, primary processing flow, statistics jobs, and resource cleanup;trip_controller.py: handles trip-domain processing including raw CSV ingestion, field cleaning, UDF-based feature engineering, partitioned warehousing, and application-layer aggregation.
- Provides common capabilities such as file operations, time utilities, geospatial tools, JSON handling, and logging;
- Supports trip data normalization, geolocation checks from lat/lon, and business feature computation.
chart_util.pyprovides statistical plotting utilities based on Matplotlib/Seaborn;- Enables quick visual exploration of analysis results (e.g., count distributions).
data/: stores raw quarterly Metro Bike Share datasets;map_poly_json/andgeo_shape/: store geographic boundary data for region checks (e.g., LA area);results/: stores statistical outputs and charts (e.g., correlation matrix, clustering figures).
The project's main script is main.py, and Jupyter Notebook visualization is provided via analysis.ipynb.
Note: Spark only supports Java 8 (not JDK 11).
The data directory is located at ./data. Please ensure your data files are placed here.
Our primary data source is the Metro Bike Share data site. We are only using data from the past three years, as older data may have different formats.
You can also download the data from our GitHub repository, here.
Configure the environment variables as follows:
# spark
export SPARK_HOME=/usr/local/Cellar/apache-spark/3.2.0
export PATH=$PATH:$SPARK_HOME/bin
export PYTHONPATH=$SPARK_HOME/libexec/python
export PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.10.9.2-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python/build:$PYTHONPATH
# hadoop
export HADOOP_HOME=/usr/local/Cellar/hadoop/3.3.1
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_OPTS="-Djava.library.path=$HADOOP_HOME/lib/native"
# scala
export SCALA_HOME=/usr/local/Cellar/scala/2.13.7
export PATH=$PATH:$SCALA_HOME/binInstall the virtualenv package using pip:
pip install virtualenvInstall the jupyter notebook:
pip install jupyterA venv directory is provided, which includes all necessary Python packages.
To activate the virtual environment, use: source venv/bin/activate
To exit the virtual environment, use: deactivate
To generate the tables, use:
spark-submit main.pyTo start jupyter notebook:
jupyter notebookTo start the visualization tool, use:
flask run --host=0.0.0.0 --port=5000Then, access it in your browser at: http://0.0.0.0:8888/notebooks/analysis.ipynb
If you encounter issues, please check the following potential solutions:
-
If you encounter the error "ERROR XSDB6: Another instance of Derby may have already booted the database":
ps -ef | grep spark-shell kill -9 <processID>
-
If you get the "Java version unsupported" error from
org.apache.spark.storage.StorageUtils, please ensure you're using Java 8. -
If you need to use a specific Java version, you can modify the
java8_locationvariable inspark_util.pyto set theJAVA_HOMEfor this program. -
If you're having trouble with Hive, try deleting
db.lckanddbex.lckin themetastore_dbdirectory. -
For other issues, please submit them here.
The detailed report for this project is provided in "ANALYSIS OF SHARED BICYCLE OPERATION.pdf". It includes a comprehensive overview of the entire project, covering the following sections:
- Introduction
- Preprocessing
- Analytical Framework
- System Design
- Evaluation
- Related Research
- Conclusion
We plan to add weather features to the analysis. Potential weather data sources include:
- FiveThirtyEight US Weather History
- CIMIS Weather Station Data
- Weather Underground KLAX Data
- NOAA GHCN PDS
- Daily Weather Station Data